Co-evolutionary multi-task learning for dynamic time series prediction

Rohitash Chandraa,1, Yew-Soon Ongc, Chi-Keong Gohc

a Centre for Translational Data Science, The University of Sydney, Sydney, NSW 2006, Australia b School of Geosciences, The University of Sydney, Sydney, NSW 2006, Australia c Rolls Royce @NTU Corp Lab, Nanyang Technological University, 42 Nanyang View, Singapore

Abstract Time series prediction typically consists of a data reconstruction phase where the time series is broken into overlapping windows known as the timespan. The size of the timespan can be seen as a way of determining the extent of past information required for an effective prediction. In certain applications such as the prediction of wind-intensity of storms and cyclones, prediction models need to be dynamic in accommodating different values of the timespan. These applications require robust prediction as soon as the event takes place. We identify a new category of problem called dynamic time series prediction that requires a model to give prediction when presented with varying lengths of the timespan. In this paper, we propose a co-evolutionary multi-task learning method that provides a synergy between multi-task learning and co-evolutionary algorithms to address dynamic time series prediction. The method features effective use of building blocks of knowledge inspired by dynamic programming and multi-task learning. It enables neural networks to retain modularity during training for making a decision in situations even when certain inputs are missing. The effectiveness of the method is demonstrated using one-step-ahead chaotic time series and tropical cyclone wind-intensity prediction. Keywords: Coevolution; multi-task learning; modular neural networks; chaotic time series; and dynamic programming.

1. Introduction the same timespan which makes dynamic time series prediction a challenging problem. Time series prediction problems can Time series prediction typically involves a pre-processing be generally characterised into three major types of problems stage where the original time series is reconstructed into a state- that include one-step [3, 2, 7], multi-step-ahead [17, 18, 19], space representation that is used as dataset for training models and multi-variate time series prediction [20, 21, 22]. These such as neural networks [1, 2, 3, 4, 5, 6, 7, 8]. The reconstruc- problems at times may overlap with each other, for instance, tion involves breaking the time series using overlapping win- a multi-step-ahead prediction can have a multi-variate compo- dows known as timespan taken at regular intervals which de- nent. Similarly, a one-step prediction can also have a multi- fines the time lag [9]. The optimal values for timespan and time variate component, or a one-step ahead prediction can be used lag are needed for effective prediction. These values vary on for multi-step prediction and vice-versa. In this paper, we iden- the type of problem and require costly computational evalua- tify a special class of problems that require dynamic prediction tion for model selection; hence, some effort has been made to with the hope that the trained model can be useful for different address this issue. Multi-objective and competitive coevolution instances of the problem. methods have been used to take advantage of different features Multi-task learning employs shared representation knowl- from the timespan during training [10, 11]. Moreover, neural edge for learning multiple instances from the same problem network have been used for determining optimal timespan of with the goal to develop models with improved performance in selected time series problems [12].

arXiv:1703.01887v2 [cs.NE] 13 Jun 2018 decision making [23, 24, 25, 26]. We note that different values In time series for natural disasters such as cyclones [13, 14, in the timespan can be used to generate several distinct datasets 15], it is important to develop models that can make predictions that have overlapping features which can be used to train mod- dynamically, i.e. the model has the ability to make a prediction ules for shared knowledge representation as needed for multi- as soon as any observation or data is available. The minimal task learning. Hence, it is important to ensure that modularity value for the timespan can have huge impact for the case of is retained in such a way so that decision making can take place cyclones, where data is only available every 6 hours [16]. A even when certain inputs are missing. Modular neural networks way to address such categories of problems is to devise robust have been motivated from repeating structures in nature and ap- training algorithms and models that are capable of performing plied for visual recognition tasks [27]. has been given different types of input or subtasks. We define dynamic used to optimise performance and connection costs in modular time series prediction as a problem that requires dynamic pre- neural networks [28] which also has the potential of learning diction given a set of input features that vary in size. It has been new tasks without forgetting old ones [29]. The features of highlighted in recent work [16] that recurrent neural networks modular learning provide motivation to be incorporated with trained with a predefined timespan can only generalise well for

Preprint submitted to June 14, 2018 multi-task learning for dynamic time series prediction. of the co-evolutionary multi-task learning method for dynamic In dynamic programming, a large problem is broken down time series prediction. Section 4 presents the results with dis- into sub-problems, from which at least one sub-problem is used cussion and Section 5 presents the conclusions and directions as a building block for the optimisation problem. Although dy- for future research. namic programming has been primarily used for optimisation problems, it has been briefly explored for data driven learning 2. Background and Related Work [30] [31]. The notion of using sub-problems as building block in dynamic programming can be used in developing algorithms 2.1. Multi-task learning and applications for multi-task learning. Cooperative coevolution (CC) is a di- A number of approaches have been presented that consid- vide and conquer approach that divides a problem into subcom- ers multi-task learning [23] for different types of problems that ponents that are implemented as sub-populations [32]. CC has include supervised and unsupervised learning [38, 39, 40, 41]. been effective for learning difficult problems using neural net- The major approach to address negative transfer for multi-task works [33]. Potter and De Jong demonstrated that CC provides learning has been through task grouping where knowledge trans- more diverse solutions through the sub-populations when com- fer is performed only within each group [42, 43]. Bakker et al. pared to conventional evolutionary algorithms [33]. CC has for instance, presented a been very effective for training recurrent neural networks for Bayesian approach in which some of the model parameters time series prediction problems [7, 8]. were shared and others loosely connected through a joint prior Although multi-task learning has mainly been used for ma- distribution learnt from the data [43]. Zhang and Yeung pre- chine learning problems, the concept of shared knowledge rep- sented a convex formulation for multi-task metric learning by resentation has motivated other domains. In the optimisation modeling the task relationships in the form of a task covariance literature, multi-task evolutionary algorithms have been pro- matrix [42]. Moreover, Zhong et al. presented flexible multi- posed for exploring and exploiting common knowledge between task learning framework to identify latent grouping structures the tasks and enabling transfer of knowledge between them in order to restrict negative knowledge transfer [44]. Multi- for optimisation [34, 35]. It was demonstrated that knowledge task learning has recently contributed to a number of successful from related tasks can help in speeding up the optimisation pro- real-world applications that gained better performance by ex- cess and obtain better quality solutions when compared to con- ploiting shared knowledge for multi-task formulation. Some of ventional (single-task optimisation) approaches. Evolutionary these applications include 1) multi-task approach for “ retweet” multi-task learning has been used for efficiently training feed- prediction behaviour of individual users [45], 2) recognition of forward neural networks for n-bit parity problem [36], where facial action units [22], 3) automated Human Epithelial Type 2 different subtasks were implemented as different topologies that (HEp-2) cell classification [46], 4) kin-relationship verification obtained improved training performance. In the literature, syn- using visual features [47] and 5) object tracking [48]. ergy of dynamic programming, multi-task learning and neu- roevolution has not been explored. Ensemble learning meth- 2.2. Cooperative Neuro-evolution ods would be able to address dynamic time series to an extent, where an ensemble is defined by the timespan of the time se- Neuro-evolution employs evolutionary algorithms for train- ries. Howsoever, it would not have the feature of shared knowl- ing neural networks [49] which can be classified into direct edge representation that is provided through multi-task learn- [49, 50] and indirect encoding strategies [51]. In direct en- ing. Moreover, there is a need for a unified model for dynamic coding, every connection and neuron is specified directly and times series problems due to problems that require dynamic explicitly in the genotype [49, 50]. In indirect encoding, the prediction. genotype specifies rules or some other structure for generating In this paper, we propose a co-evolutionary multi-tasking the network [51]. Performance of direct and indirect encodings method that provides a synergy between multi-task learning, varies for specific problems. Indirect encodings seem very intu- dynamic programming and coevolutionary algorithms. The method itive and have biological motivations, however, in several cases enables neural networks to be trained by featuring shared and they have shown not to outperform direct encoding strategies modular knowledge representation in order to make predictions [52, 53]. given limited input features. This enables the learning pro- Cooperative coevolution for training neural networks is known cess to employ modules of knowledge from the related sub- as cooperative neuroevolution [33, 54, 55]. Although coopera- tasks as building blocks of knowledge for a unified model. The tive coevolution faced challenges in problem decomposition, it proposed method is used for one-step-ahead chaotic time se- showed promising features that included modularity and diver- ries problems using feedforward neural networks for bench- sity [33]. Further challenges have been in area of credit assign- mark problems. The method is also used for tropical cyclone ment for subcomponents [33, 54], problem decomposition, and wind-intensity prediction and addresses the problem of mini- adaptation due to problem of separability that refer to grouping mal timespan where dynamic prediction is required. The paper interacting or highly correlated variables [55]. In cooperative extends results presented in [37]. neuro-evolution, problem decomposition has a major effect in The rest of the paper is organised as follows. Section 2 the training and generalisation performance. Although several gives a background on multi-task learning, cooperative neuro- decomposition strategies have been implemented that vary for evolution, and time series prediction. Section 3 gives details 2 different network architectures, the two established decompo- problems [61], it has been well used in the areas of machine sition methods are those on the synapse [52] and neuron level learning (such as spoken word recognition [62]), and computer [56, 55, 57]. In synapse level, the network is decomposed to its vision (such as variational problems [63]). lowest level where each weight connection (synapse) forms a Reinforcement learning on the other hand, considers agents subcomponent [52, 6]. In neuron level, the neurons in the net- that take actions in an environment to maximise the notion of work act as the reference point for each subcomponent [58, 57]. cumulative reward [64]. Reinforcement learning has a wide They have shown good performance in pattern classification range of multi-disciplinary applications such as game theory, problems [59, 56, 57]. Synapse level decomposition has shown control theory, and operations research [65]. In the operations good performance in control and time series prediction prob- research and control literature, reinforcement learning is called lems [52, 6, 7], however, they gave poor performance for pat- approximate dynamic programming [66], or neuro-dynamic pro- tern classification problems [55]. Chandra et al. applied neural gramming [67]. They combine ideas from the fields of neu- and synapse level decomposition for chaotic time series prob- ral networks, cognitive science and approximation theory. In lems using recurrent neural networks [7]. Hence, it was es- , the environment is typically formulated as a tablished that synapse level encoding was more effective for Markov decision process (MDP) where the outcomes are partly time series and control problems [52, 7]. Chandra later pre- random and partly under the control of a decision maker. Op- sented competition and collaboration with neuron and synapse posed to classical dynamic programming, reinforcement learn- decomposition strategies during evolution which improved the ing does not assume knowledge of an exact mathematical model performance further [8]. of the MDP. Recent application via deep learning considers learning of policies directly from high-dimensional sensory in- Alg. 1 Cooperative neuroevolution puts in the challenging domain of classic Atari 2600 games Step 1: Decompose the problem (neuron or synapse level decom- [68]. position) Evolutionary algorithms have been proposed as a method Step 2: Initialise and cooperatively evaluate each sub-population for reinforcement learning [69]. Reinforcement learning via for each cycle until termination do evolutionary algorithms have been implemented as neuroevo- for each Sub-population do lution [70] for neural networks with application for playing for n Generations do the game of ’Go’. Reinforcement learning more recently has i) Select and create new offspring been implemented with co-evolutionary algorithms in a clas- ii) Cooperatively evaluate the new offspring sical control problem that considers balancing double inverted iii) Update sub-population poles [52]. This further motivates the methodology presented end for end for in this paper that provides a synergy between, dynamic pro- end for gramming, reinforcement learning and neuroevolution for co- evolutionary multi-task learning.

In Algorithm 1, the network is decomposed according to the 2.4. Machine learning and optimisation selected decomposition method. Neuron level decomposition is Essentially, machine learning algorithms have three compo- shown in Figure 1. Once the decomposition is done, the sub- nents that include representation, evaluation and optimisation. components that are implemented as sub-populations are ini- Representation is done in the initial stage when the problem tialized and evolved in a round-robin fashion, typically for a is defined and form formulated. Representation considers the fixed depth of search given by generations. The evaluation of type of problem (classification, regression or prediction) and the fitness of each individual for a particular sub-population is the evaluation metrics such as squared error loss and classifi- done cooperatively by concatenating the current individual with cation performance. Representation also considers initialisa- the fittest individuals from the rest of the sub-populations [33]. tion of the parameters, such as the weights of the neural net- The concatenated individual is then encoded into the neural net- works and hyper parameters such as the learning rate. In the work where its fitness is evaluated and returned. Although it is case of cooperative neuroevolution, the representation compo- a representative fitness, the fitness of the entire network is as- nent would consider the encoding of the network weights into signed to the particular individual of the sub-population. This the subpopulations and initialising them for evolution. Eval- is further illustrated in Figure 1. uation and optimisation are components that iterate over time until a certain condition is met. Machine learning can also 2.3. Dynamic programming and reinforcement learning be seen as a data driven optimisation process. The learning Dynamic programming (also known as dynamic optimisa- procedure can be seen as solving a core optimisation problem tion) is a optimisation strategy that considers breaking a large that optimises the variables or parameters of the model with re- problem into sub-problems and using their solution as a build- spect to the given loss function. Evolutionary algorithms are ing block to solve the bigger problem [60]. By simply using typically considered as optimisation methods and their synergy previously computed solution taken from a sub-problem, the with neural networks into neuroevolution can be viewed as a paradigm improves the computation time and also becomes ef- learning procedure. In this paper, learning in neural networks ficient in memory or storage. Although dynamic programming is implemented using co-evolutionary algorithms that features has typically been an approach for optimisation and sequential elements from multi-task learning and dynamic programming. 3 Neuron Level Problem Decomposition Encoding

1 Sub-populations (SP) Output of Cooperative Neuro-Evolution layer

3 2

SP (1) SP (2) SP (3) Evolution Evolution Evolution

Hidden layer Input layer Feedforward Neural Network

Figure 1: Feedforward network with Neuron level decomposition. Note that 4 input neurons represent time series reconstruction with timespan of 4.

Moreover, learning is also referred to evolution in the context MSA as the prediction horizon becomes larger, multi-variate in- of neuroevolution. Bennett and Parrado-Hernndez in an intro- formation becomes more important [80]. Another area of prob- ductory note to a special issue of a journal mentioned that op- lems in time series prediction consist of applications that have timisation problems lie at the heart of most machine learning missing data. Wu et al. approached the missing data problem approaches [71]. They highlighted the need for dealing with in time series with non-linear filters and neural networks [81]. uncertainty, convex models, hyper-parameters, and hybrid ap- In their method, a sequence of independent Bernoulli random proaches of optimisation methods for learning. Furthermore, variables were used to model random interruptions which was Guillory et al. showed that online active learning algorithms later used to construct the state-space vector in pre-processing can be viewed as stochastic gradient descent on non-convex ob- stage. jective functions [72]. Furthermore, novel approaches that feature a synergy of dif- ferent methodologies have recently been presented to address 2.5. Problems in time series prediction time series prediction. Extreme value analysis considers the ex- Although a number of methods have been used for one- treme deviations from the median of probability distributions, step ahead prediction, neural networks have given promising which has been beneficial for time series prediction in the past results with different architectures [2, 7] and algorithms that in- [82]. D’Urso et al. explored the grouping of time series with clude gradient-based learning [73, 1], evolutionary algorithms similar seasonal patterns using extreme value analysis with fuzzy [6, 7, 8], and hybrid learning methods [3, 4, 2]. These meth- clustering with an application daily sea-level time series in Aus- ods can also be used for multi-step ahead and multivariate time tralia [83]. Chouikhi et al. presented echo state networks for series prediction. Multi-step-ahead (MSA) prediction refers to time series prediction where particle swarm optimised was used the forecasting or prediction of a sequence of future values from to optimise the untrained weights that gave enhancement to observed trend in a time series [74]. It is challenging to de- learning [84]. Such approaches give motivations for developing velop models that produce low prediction error as the predic- a synergy of different methods in order to utilise their strengths tion horizon increases [17, 18, 19]. MSA prediction has been and eliminate their weaknesses. approached mostly with the recursive and direct strategies. In the recursive strategy, the prediction from a one-step-ahead pre- 3. Co-evolutionary Multi-task Learning diction model is used as input for future prediction horizon [75, 76]. Although relatively new, a third strategy is a com- 3.1. Preliminaries: time series reconstruction bination of these approaches [75, 77]. State-space reconstruction considers the use of Taken’s the- Multi-variate time series prediction typically involves the orem which expresses that the state-space vector reproduces prediction of single or multiple values from multi-variate in- important characteristics of the original time series [9]. Given put that are typically interconnected through some event [20, an observed time series x(t), an embedded state space Y(t) = 21, 22]. Examples of single value prediction are the prediction [(x(t), x(t − T), ..., x(t − (D − 1)T)] can be generated, where, T of flour prices of time series obtained from different cities [20] is the time delay, D is the timespan (also known as embedding and traffic time series [78]. The goal in this case is to enhance dimension), t = (D − 1)T, DT, ..., N − 1, and N is the length of the prediction performance from the additional features in the the original time series. The optimal values for D and T must input, although the problem can be solved in a univariate ap- be chosen in order to efficiently apply Taken’s theorem [85]. proach [78]. In the case of prediction of multiple values, the Taken’s proved that if the original attractor is of dimension d, model needs to predict future values of the different features, then D = 2d + 1 will be sufficient to reconstruct the attractor for example, prediction of latitude and longitude that defines [9]. In the case of using feedforward neural networks, D is the the movement of cyclones [79]. A recent study has shown that number of input neurons. multivariate prediction would perform better than univariate for

4 3.2. Dynamic time series prediction as cooperative coevolution, however, the major difference lies Natural disasters such as torrential rainfall, cyclones, tor- in the way the solutions of the subcomponents are combined to nadoes, wave surges and droughts [86, 87, 15, 14] require dy- build to the final solution. Hence, the proposed co-evolutionary namic and robust prediction models that can make a decision as multi-task learning algorithm is inspired from the strategies used soon as the event take place. Therefore, if the model is trained in dynamic programming where a subset of the solution is used over specific months for rainy seasons, the system should be as the main building block for the optimisation problem. In this able to make a robust prediction from the beginning of the rainy case, the problem is learning the weights of a cascaded neu- season. We define the event length as the duration of an event ral network architecture where the base problem is the network which can be number of hours of a cyclone or number of days module that is defined by lowest number of input features and of drought or torrential rain. hidden neurons. As noted earlier, in a typical time series prediction prob- The weights in the base network are part of larger cascaded lem, the original time series is reconstructed using Taken’s the- network modules that consist of additional hidden neurons and orem [9, 85]. In the case of cyclones, it is important to mea- input features. This can be viewed as modules of knowledge sure the performance of the model when dynamic prediction is that are combined for larger subtasks that use knowledge from needed regarding track, wind or other characteristics of the cy- smaller subtasks as building blocks. The cascaded network ar- clone [16]. Dynamic prediction can provide early warnings to chitecture can also be viewed as an ensemble of neural networks the community at risk. For instance, data about tropical cyclone that feature distinct topologies in terms of number input and in the South Pacific is recorded at six hour intervals [88]. If the hidden neurons as shown in Figure 2. Suppose that we refer to timespan D = 6, the first prediction by the model at hand would a module in the cascaded ensemble, there are M modules with come after 36 hours which could have devastating effects. input i, hidden h layers as shown. The problem arises when the gap between each data point in the times series is a day or number of hours. The problem I = [i1, i2, ..., iM] with the existing models such as neural networks used for cy- clones is the minimal timespan D needed to make a prediction. H = [h1, h2, ..., hM] (4) It has been reported that recurrent neural networks trained with O = [1, 1, ..., 1] a given timespan (e.g. D = 5), cannot make robust prediction for other timespan ( e.g. D = 7 or D = 3 ) [16]. Therefore, we where I, H, and O contain the set of input, hidden and out- introduce and define the problem of dynamic time series pre- put layers. Note that the approach considers fixed number of diction that refers to the ability of a model to give a prediction output neurons. Since we consider one-step-ahead time series given a set of timespan values rather than a single one. This problem, one neuron in output layer is used for all the respec- enables the model to make decision with minimum value of the tive modules. The input for each of the modules is given by the timespan in cases when rest of features of data-points are not dynamic nature of the problem that considers different lengths available. A conventional one-step ahead time series prediction of timespan that constructs an input vector Ψm for the given can be given by module as follows.

Ψm = x[t], x[t − 1], ..., x[t − Im] (5) x = x[t], x[t − 1], ..., x[t − D] (1) Note that the input-hidden layer ωm weights and the hidden- x[t + 1] = f (x) output layer υm weights are combined for the respective module m. The base knowledge module is given as Φ = [ω , υ ]. The where f (.) is a model such as a feedforward neural network 1 1 1 subtask θ is defined as the problem of training the respective and D is a fixed value for the timespan and x refers to the input m knowledge modules Φ with given input Ψ . Note that Figure 3 features. In the case of dynamic time series, rather than a single m m explicitly shows the knowledge modules of the network for ω value, we consider a set of values for the timespan 2 and ω1, respectively. The knowledge module for each subtask is constructed in a cascaded network architecture as follows. Ωm = [D1, D2, ...DM] (2)

where M is the number of subtasks, given M ≤ D. Hence, Φ1 = [ω1, υ1]; θ1 = (Φ1) the input features for each subtask in dynamic time series pre- Φ2 = [ω2, υ2]; θ2 = [θ1, Φ2] diction can be given by Ψm, where m = 1, 2, ..., M. . .

ΦM = [ωM, υM]; θM = [θM−1, ΦM] Ψm = x[t], x[t − 1], ..., x[t − Ωm] (3) (6) 3.3. Method The vector of knowledge modules considered for training or In the proposed method, a co- based optimisation is therefore Φ = (Φ1,..., ΦM). on a dynamic programming strategy is proposed for multi-task learning. It features problem decomposition in a similar way

5 Alg. 2 Co-evolutionary multi-task learning

Data: Requires input Ψm taken from data Σ for respective subtasks θm. y1 = f (θ1, Ψ1) Result: Prediction error E for dynamic time series y = f (θ , Ψ ) for each module m do 2 2 2 1. Assign fixed depth of search β that defines the number of generation of (7) . evolution for each subcomponent S m; (eg. β = 5 ) . 2. Define the weight space (input - hidden and hidden - output layer) for the different subtasks θm defined by the respective network module yM = f (θM, ΨM) Φm = [ωm, υm]. 3. Create X vector of subtask solutions from Θ . Note that the size of Given T samples of data, the loss L for sample t can be m m Xm depends on number of network module considered in the subtask. calculated by root mean squared error. 4. Initialise genotype of the sub-populations S m j given by j individuals in a range [−α, α] drawn from a uniform distribution where α defines the v tu M range. 1 X end L = (yˆ − y )2 (8) t M m while each phase until termination do m=1 for each subtask m do for each generation until β do wherey ˆ is the observed time series and ym is the prediction for each Individual j in subpopulation Πi, j do given by subtask m. The loss E for the entire dataset (all sub- ** Stage 1: tasks) is given by * Assign Bm as best individual for subtask m from subpop- ulation Πi, j if m == 1 then T 1 X * Get the current individual from subpopulation; V1 = E = L (9) Π where i includes all the variables (weights). As- T t i, j t=1 sign the current individual as X1 = V1 end The training of the cascaded network architecture involves else decomposition as subtasks through co-evolutionary multi-task * Get the current individual from subpopulation; learning (CMTL) algorithm. The knowledge modules in sub- Vm = Πi, j. Append the current subtask solution with best solutions from previous subtasks, Xm = tasks denoted by Φm are implemented as subcomponents S 1, S 2, ..S M, [B1, ..., Bm−1, Vm] where B is the best individual from where M is number of subtasks. The subcomponents are imple- previous subtask and V is the current individual that mented as sub-populations consist of matrix of variables that needs to be evaluated. end feature the weights and biases S m = Πi, j, where i refers to ** Stage 2: weights and biases and j refers to the individuals. The individ- * Execute Algorithm 3: This will encode the Xm into uals of the sub-populations are referred as genotype while the knowledge modules Φ = (Φ1,..., ΦM) for subtasks corresponding network module are referred as the phenotype. f (θm, Ψm) * Calculate subtask output ym = f (θm, Ψm) and evaluate Unlike conventional transfer learning methods, the transfer of 1 PT the given genotype though the loss function E = T t=1 Lt, knowledge here is done implicitly through the sub-populations where Lt is given in Equation 9. in CMTL. The additional subtasks are implemented through the end cascades that utilise knowledge from the base subtask. The fit- for each Individual j in S m do ness of the cascade is evaluated by utilising the knowledge from * Create new offspring via evolutionary operators such the base subtask. This is done through CMTL where the best selection, crossover and mutation solution from the sub-population of the base subtask is con- end catenated with the current individual from the sub-population * Update S m whose fitness needs to be evaluated. This is how transfer of end knowledge is implicitly done through co-evolutionary multi- end end task learning. * Test the obtained solution Algorithm 2 gives details for CMTL which begins by ini- for each module m do tialising the the sub-populations defined by the subtasks which 1. Load best solution S best from S m. feature the knowledge modules Φ and respective subtask input 2. Encode into the respective weight space for the subtask Φm = [ωm, υm]. m 3. Calculate loss E based on the test dataset. features Ψm. The sub-populations are initialised with real val- end ues [−α, α] drawn from uniform distribution where α defines the range. Once this has been done, the algorithm moves into the evolution phase where each subtask is evolved for a fixed subtasks or modules are not required to reach a decision. How- number for generations defined by depth of search, β. The ma- soever, given that the subtask is not a base problem, the current jor concern here is the way the phenotype is mapped into geno- subtask individual Xm is appended with best individuals from type where a group of weight matrices given by Φm = [ωm, υm] the previous subtasks, therefore, Xm = [B1, ..., Bm−1, Vm], where that makes up subtask θm are converted into vector Xm. Stage 1 B is the best individual from previous subtask and V is the cur- in Algorithm 2 implements the use of knowledge from previous rent individual that needs to be evaluated. This will encode Xm subtasks through multi-task learning. In the case if the subtask into knowledge modules Φ = (Φ1,..., ΦM) for the respective is a base problem (m == 1), then the subtask solution Xm is subtasks. The algorithm then calculates subtask network output utilised in a conventional matter where knowledge from other or prediction ym = f (θm, Ψm) and evaluate the individual though 6 1 PT the loss function E = T t=1 Lt, where Lt is given in Equation mapped from different sub-populations defined by the subtasks. 9. The subtask solution is passed to Algorithm 3 along with The algorithm is given input parameters which are the network topology in order to decode the subtask solution into the respective weights of the network. This could be seen 1. The reference to subtask m; as the process of genotype to phenotype mapping. This pro- 2. The current subtask solution; if m is base task, Xm = [Vm] cedure is executed for every individual in the sub-population. else Xm = [B1, ..., Bm−1, Vm], otherwise. This procedure is repeated for every sub-population for differ- 3. The topology of the respective cascaded neural module ent phases until the termination condition is satisfied. The ter- for the different subtasks in terms of number of input, mination condition can be either the maximum number of func- hidden, and output neurons. tion evaluations or a minimum fitness value from the training We describe the algorithm with reference to Figure 3 which or validation dataset. Figure 2 shows an exploded view of the shows a case, where the subtask m = 3 goes through the transfer neural network topologies associated with the respective sub- where m = 1 and m = 2 are used as building blocks of knowl- tasks, however, they are part of the cascaded network architec- edge given in the weights. Therefore, we use examples for the ture later shown in Figure 3. The way the subtask solution is network topology as highlighted below. decomposed and mapped into the network is given in Figure 3 and discussed detail in the next section. 1. Im is vector of number of input neurons for the respective The major difference in the implementation of CMTL when subtasks, eg. I = [2, 3, 4] ; compared to conventional cooperative neuroevolution (Algo- 2. Hm is vector of number of input neurons for the respective rithm 1) is by the way the problem is decomposed and the way subtasks, eg. H=[2, 3, 4]; the fitness for each individual is calculated. CMTL is moti- 3. Om is vector of output neurons for the respective sub- vated by dynamic programming approach where the best so- tasks, eg. O = [1, 1, 1]. lution from previous sub-populations are used for cooperative fitness evaluation for individuals in the current sub-population. The algorithm begins by assigning base case, b = 1 which Howsoever, the current sub-population does not use the best is applied irrespective of the number of subtasks. In Step 1, the solution from future subpopulations. This way, the concept of transfer of Input-Hidden layer weights is shown by weights (1- utilising knowledge from previous subtasks as building blocks 4) in Figure 3. Step 2 executes the transfer for Hidden-Output is implemented . On the other hand, cooperative neuroevolu- layer weights as shown by weights (5-6) in Figure 3. Note that tion follows a divide and conquer approach where at any given Step 1 and 2 are applied for all the cases given by the num- subpopulation, in order to evaluate the individuals, the best in- ber of subtasks. Once this is done, the algorithm terminates if dividuals from the rest of the sub-populations are taken into m = 1 or proceeds if m >= 2. Moving on, in Step 3, the case account. is more complex as we consider m >= 2. Step 1 and 2 are exe- Finally, when the termination criteria has been met, the al- cuted before moving to Step 3 where X contains the appended gorithm moves into the testing phase where the best solutions solution sets from previous subtasks. In Step 3, t in principle from all the different subtasks are saved and encoded into their points to the beginning of the solution given by sub-population respective network topologies. Once this is done, the respec- for m = 2. Here, the transfer for Input-Hidden layer weights tive subtask test data is loaded and the network makes predic- (7-9) is executed for m = 2. Note that in this case, we begin tion that is evaluated with loss E given in Equation 9. Other with the weights with reference to the number of hidden neu- measure of error can also be implemented. Hence, we have rons from previous subtask j = H(m−1) + 1, and move to the highlighted the association of every individual in the respec- number of hidden neurons of the current subtask j = H(m) in or- tive sub-populations with different subtasks in the multi-task der to transfer the weights to all the input neurons. This refers to learning environment. There is transfer of knowledge in terms weights (7-9) in Figure 3. Before reaching transfer for m = 3, of weights from smaller to bigger networks as defined by the m = 1 and m = 2 transfer would have already taken place and subtask with its data which is covered in detail in next sec- hence the weights (13-16) would be transferred as shown in tion. A Matlab implementation of this algorithm with respec- the same figure. Moving on to Step 4, we first consider the tive datasets used for the experiments is given online 1. transfer for Input-Hidden layer weights for m = 2 through the transfer of weights from beginning of previous subtask input, 3.4. Transfer of knowledge i = I(m−1) + 1 to current subtask input connected with all hidden One challenging aspect of the Algorithm 2 is the transfer of neurons. This is given by weights (10-11) in Figure 3. For the knowledge represented by the weights of the respective neural case of m = 3, this would refer to weights (17-19) in the same networks that is learnt by the different subtasks in CMTL. The figure. cascading network architecture increase in terms of the input Finally, in Step 5, the algorithm executes the transfer for and hidden neurons with the subtasks. Algorithm 3 implements Hidden-Output layer weights based on the hidden neurons from transfer of knowledge given the changes of the architecture by previous subtask. In case of m = 2, this results in transferring the different subtasks. The goal is to transfer weights that are weight (12) and for m = 3, the transfer is weight (20) in Figure 3, respectively. Note that the algorithm can transfer any number of input and hidden neurons as the number of subtasks increase. 1 https://github.com/rohitash-chandra/CMTL dynamictimeseries 7 Figure 2: Problem decomposition as subtasks in co-evolutionary multi-task learning. Note that the colours associated with the synapses in the network are linked to their encoding that are given as different subtasks. Subtask 1 employs a network topology with 2 hidden neurons while the rest of the subtasks add extra input and hidden neurons. The exploded view shows that the different neural network topologies are assigned to the different subtasks, however, they are all part of the same network as shown in Figure 3

Figure 3: Transfer of knowledge from subtasks encoding as sub-populations in co-evolutionary multi-task learning algorithm. This diagram shows transfer of knowledge from Subtask 1 and Subtask 2. Once the knowledge is transferred into Subtask 3, the network loads Subtask 3 dataset (4 features in this example) for further evolution of the related sub-population

8 Alg. 3 Transfer of knowledge from previous subtasks 4.1. Benchmark Chaotic Time Series Problems Parameters: Subtask m, module subtask solution X , Input I, Hidden H and In the benchmark chaotic time series problems, the Mackey- Output O Glass, Lorenz, Henon and Rossler are the four synthetic time b = 1; ( Base task) series problems. The experiments use the chaotic time series Step 1 with length of 1000 generated by the respective chaotic attrac- for each j = 1 to Hb do tor. The first 500 samples are used for training and the remain- for each i = 1 to Ib do ing for testing. Wi j = Xt t = t + 1 In all cases, the phase space of the original time series is end reconstructed with the timespan for 3 datasets for the respective end , , Step 2 subtasks with the set of timespan Ω = [3 5 7] and time lag for each k = 1 to Ob do T = 2. All the synthetic and real-world time series were scaled for each j = 1 to Hb do in the range [0,1]. Further details of each of the time series W jk = Xt problem is given as follows. t = t + 1 end The Mackey-Glass time series has been used in literature end as a benchmark problem due to its chaotic nature [89]. The if m >= 2 then Lorenz time series was introduced by Edward Lorenz who has Step 3 for each j = Hm−1 + 1 to Hm do extensively contributed to the establishment of Chaos theory for each i = 1 to Im do [90]. The Henon time series is generated with a Henon map Wi j = Xt which is a discrete-time dynamical system that exhibit chaotic t = t + 1 end behaviour [91] and the Rossler time series is generated using end the attractor for the Rossler system, a system of three non-linear Step 4 ordinary differential equations as given in [92]. The real-world for each j = 1 to Hm -1 do problem are the Sunspot, ACI finance and Laser time series. for each i = Im−1 + 1 to Im do Wi j = Xt The Sunspot time series is a good indication of the solar activi- t = t + 1 ties for solar cycles which impacts Earth’s climate, weather pat- end end terns, satellite and space missions [93]. The Sunspot time series Step 5 from November 1834 to June 2001 is selected which consists of for each k = 1 to Om do 2000 points. The ACI financial time series is obtained from the for each j = Hm−1 + 1 to Hm do ACI Worldwide Inc. which is one of the companies listed on W j,k = Xt t = t + 1 the NASDAQ stock exchange. The data set contains closing end stock prices from December 2006 to February 2010, which is end equivalent to approximately 800 data points. The closing stock end prices were normalised between 0 and 1. The data set features the recession that hit the U.S. market in 2008 [94]. The Laser time series is measured in a physics laboratory experiment that The time complexity of CMTL considers the time taken for were used in the Santa Fe Competition [95]. All the real world transfer of solutions for different number of subtasks. We note time series used the first 50 percent samples for training and that the best case is when the subtask is the base subtask (m = remaining for testing. 1). Therefore, the worst case time complexity can be given by 4.2. Cyclone time series T(m <= 1) = O(1) The Southern Hemisphere tropical cyclone best-track data T(m) = T(m − 1) + T(m − 2), ..., +O(1) (10) from Joint Typhoon Warning Centre recorded every 6-hours is m used as the main source of data [88]. We consider only the T(m) = O(2 ) austral summer tropical cyclone season (November to April) where m refers to the subtasks. from 1985 to 2013 data in the current study as data prior to the satellite era is not reliable due to inconsistencies and missing values. The original data of tropical cyclone wind intensity in 4. Simulation and Analysis the South Pacific was divided into training and testing set as follows: This section presents an experimental study that compares the performance of CMTL with conventional neuroevolution • Training Set: Cyclones from 1985 - 2005 (219 Cyclones methods. The results are compared with neuroevolution via with 6837 data points) evolutionary algorithm (EA) and cooperative neuroevolution (CNE) • for benchmark time series prediction problems. Furthermore, Testing Set: Cyclones from 2006 - 2013 (71 Cyclones tropical cyclones from South Pacific and South Indian Ocean with 2600 data points) are considered to address the minimal timespan issue [16] us- In the case for South Indian Ocean, the details are as fol- ing dynamic time series prediction. lows: 9 • Training Set: Cyclones from 1985 - 2001 ( 285 Cyclones The same trend is shown in general for Lorenz and Henon time with 9365 data points) series as shown in Figure 5 and Figure 6, respectively. There is one exception (D = 5), for the Henon problem where CME • Testing Set: Cyclones from 2002 - 2013 ( 190 Cyclones gives better performance than EA, however, worse than CMTL. with 8295 data points ) Figure 7 shows the results for the Rossler time series which follows a similar trend when compared to the previous prob- Although the cyclones are separate events, we choose to lems. Hence, in general, CMTL generalisation performance is combine all the cyclone data in a consecutive order as given by the best when compared to the conventional methods (CNE and their date of occurrence. The time series is reconstructed with EA) for the 4 synthetic time series problems which have little the set of timespan, Ω = [3, 5, 7], and time lag T = 2. or no noise present. In the case of real-world problems, Figure 8 for the Sunspot 4.3. Experimental Design problem shows that CMTL provides the best generalisation per- We note that multi-task learning approach used for dynamic formance when compared to EA and CNE for all the cases. The time series can be formulated as a series of independent single same is given for first two timespan cases for ACI-finance prob- task learning approaches. Hence, for comparison of CMTL, lem as shown in Figure 9, except for one case (D = 7), where we provide experimentation and results with conventional neu- EA and CMTL gives the same performance. In the case of the roevolution methods that can be considered as single-task learn- Laser time series in Figure 10, which is known as one of the ing (CNE and EA). In the case of CNE, neuron level prob- most chaotic time series problems, CMTL outperforms CNE lem decomposition is applied for training feedforward networks and EA, except for one case, D = 7. Therefore, at this stage, we [55]. We employ covariance matrix adaptation evolution strate- can conclude that CMTL gives the best performance for most gies (CMAES) [96] as the evolutionary algorithm in sub-populationsof the cases in the real-world time series problems. Table 1 of CMTL, CNE and the population of EA. The training and gen- shows the mean of RMSE and confidence interval across the 3 eralisation performances are reported for each case given by the timespan. We find that the CMTL performs better than EA and different subtasks in the respective time series problems. CNE for almost all the problems. The Laser problem is the only The respective neural networks used both sigmoid units in exception where the EA is slightly better than CMTL. the hidden and output layer for all the different problems. The loss function given in Equation 9 is used as the main perfor- Mackey-Glass prediction performance 0.12 mance measure. Each neural network architecture was tested EA-Train CNE-Train with different numbers of hidden neurons. CMNN-Train 0.1 EA-Test We employ fixed depth = 5 generation in the sub-populations CNE-Test CMNN-Test of CMTL as the depth of search as it has given optimal perfor- 0.08 mance in trial runs. CNE also employs the same value. Note that all the sub-populations evolve for the same depth of search. 0.06 The population size of CMAES in the respective methods is given by P = 4 + f loor ∗ (3 ∗ log(W)), where W is the total 0.04 number of weights and biases for the entire neural network that Root Mean Squared Error (RMSE) includes all the subtasks (CMTL). In the case of EA and CNE, 0.02

W is total number of weights and biases for the given network 0 3 5 7 architecture. Subtask The termination condition is fixed at 30 000 function eval- uations for each subtask, hence, CMTL employs 120 000 func- Figure 4: Performance given by EA, CNE, CMTL for Mackey-Glass time series tion evaluations while conventional methods use 30 000 for each of the respective subtasks for all the problems. Note that since there is a fixed training time, there was no validation set 4.5. Results for Tropical Cyclones used to stop training. The choice of the parameters such as the We present the results for the performance of the given meth- appropriate population size, termination condition has been de- ods on the two selected cyclone problems, which features South termined in trial experiments. The experiments are well aligned Pacific and South Indian ocean as shown in Figure 11 and Fig- with experimental setting from previous work [7, 97]. ure 12, respectively. Figures 14 and 13 show the prediction performance of a typical experimental run. In the case of the 4.4. Results for Benchmark Problems South Pacific ocean, the results show that CMTL provides the The results for the 7 benchmark chaotic time series prob- best generalisation performance when compared to CNE and lems are given in Figure 4 to 10 which highlight the training EA. This is also observed for the South Indian ocean. and generalisation performance. We limit our discussion to the generalisation performance, although the training performance 4.6. Discussion is also shown. Figure 4 shows that CMTL generalisation per- The goal of the experiments was to evaluate if CMTL can formance is better than EA and CNE, while CMTL and EA maintain quality in prediction performance when compared to outperform CNE in all the subtasks denoted by the timespan. conventional methods, while at the same time address dynamic 10 Table 1: Combined mean prediction performance for 3 subtasks

Problem EA CNE CMTL Mackey-Glass 0.0564 ±0.0081 0.0859 ±0.0147 0.0472 ±0.0054 Lorenz 0.0444 ±0.0067 0.0650 ± 0.0127 0.0353 ± 0.0049 Henon 0.1612 ± 0.0120 0.1721 ± 0.0128 0.1267 0.0127 Rossler 0.0617± 0.0091 0.0903± 0.0138 0.0489 ±0.0054 Sunspot 0.0529 ± 0.0062 0.0773 ±0.0137 0.0399 ± 0.0052 Lazer 0.0917 ±0.0056 0.1093 ±0.0099 0.0936 ±0.0077 ACI-finance 0.0565 ±0.0091 0.0866 ±0.0159 0.0471 ±0.0087

Lorenz prediction performance Rossler prediction performance 0.09 0.12 EA-Train EA-Train CNE-Train CNE-Train 0.08 CMNN-Train CMNN-Train EA-Test 0.1 EA-Test 0.07 CNE-Test CNE-Test CMNN-Test CMNN-Test 0.06 0.08

0.05 0.06 0.04

0.03 0.04

0.02 Root Mean Squared Error (RMSE) Root Mean Squared Error (RMSE) 0.02 0.01

0 0 3 5 7 3 5 7 Subtask Subtask

Figure 5: Performance given by EA, CNE, CMTL for Lorenz time series Figure 7: Performance given by EA, CNE, CMTL for Rossler time series

Henon prediction performance Sunspot prediction performance 0.25 0.12 EA-Train EA-Train CNE-Train CNE-Train CMNN-Train CMNN-Train EA-Test 0.1 EA-Test 0.2 CNE-Test CNE-Test CMNN-Test CMNN-Test 0.08 0.15

0.06

0.1 0.04

0.05 Root Mean Squared Error (RMSE) Root Mean Squared Error (RMSE) 0.02

0 0 3 5 7 3 5 7 Subtask Subtask

Figure 6: Performance given by EA, CNE, CMTL for Henon time series Figure 8: Performance given by EA, CNE, CMTL for Sunspot time series time series problems. The results have shown that CMTL not [28]. only addresses dynamic time series but is a way to improve the It is noteworthy that CMTL gives consistent performance performance if each of the subtasks in multi-task learning was even for the cases when the problem is harder as in the case of decomposed and approached as single-task learning. The in- smaller or foundational subtasks that have minimal timespan. cremental learning in CMTL not only improves the prediction Learning smaller networks could be harder since they have lim- performance but also ensures modularity in the organisation of ited information about the past behaviour of the time series. knowledge. Modularity is an important attribute for address- The way the algorithm handles this issue is with the refinement ing dynamic prediction problems since groups of knowledge of the solutions (in a round-robin manner through coevolution) can be combined to make a decision when the nature or com- after it has been transferred to a larger network. When more plexity of the problem increases. Modularity is important for information is presented as the subtask increases, CMTL tends design of neural network in hardware [98] as disruptions in cer- to refine the knowledge in the smaller subtasks. In this way, the tain synapse(s) can result in problems with the whole network results show that the performance is consistent given the small which can be eliminated by persevering knowledge as modules subtasks and when they increase depending on the timespan.

11 ACI-Finance prediction performance South-Pacific-Ocean Time Series Performance 0.12 0.07 EA-Train EA-Train CNE-Train CNE-Train CMNN-Train 0.06 CMNN-Train 0.1 EA-Test EA-Test CNE-Test CNE-Test CMNN-Test CMNN-Test 0.05 0.08

0.04 0.06 0.03

0.04 0.02

Root Mean Squared Error (RMSE) 0.02 Root Mean Squared Error (RMSE) 0.01

0 0 3 5 7 3 5 7 Subtask Subtask

Figure 9: Performance given by EA, CNE, CMTL for ACI-finance time series Figure 12: Performance given by EA, CNE, CMTL for South Pacific ocean

Lazer prediction performance 0.14 different subtasks can be different number of features; i.e. the EA-Train CNE-Train algorithm can execute face recognition based on minimal fea- 0.12 CMNN-Train EA-Test CNE-Test tures from a set of features. CMNN-Test 0.1 CMTL could also be viewed as training cascaded networks using a dynamic programming approach where each cascade 0.08 defines a subtask. Although the depth of the cascading does 0.06 not have a limit, adding cascades could result in an exponen- tial increase of training time. The depth of the cascaded ar- 0.04 chitecture would be dependent on the application problem. It Root Mean Squared Error (RMSE) 0.02 depends on the time series under consideration and the level of

0 inter-dependencies of the current and the previous time steps. 3 5 7 In principle, one should stop adding cascades when the pre- Subtask diction performance begins to deteriorate to a given threshold. Figure 10: Performance given by EA, CNE, CMTL for Laser time series Therefore, for a given problem, there needs to be systematic approach that selects the number of subtasks for the cascaded South-Indian-Ocean Time Series Performance architecture. 0.06 EA-Train We have experimentally tested robustness and scalability CNE-Train CMNN-Train using the synthetic and real-world datasets that includes bench- 0.05 EA-Test CNE-Test CMNN-Test mark problems and an application that considers the prediction 0.04 of wind-intensity in tropical cyclones. We provided compre- hensive experimentation for the algorithm convergence given a 0.03 range of initial conditions. These include different set of ini- tialisation of the subpopulations in CMTL with multiple exper- 0.02 imental runs along with further reporting of the mean and con-

Root Mean Squared Error (RMSE) 0.01 fidence interval. We evaluated the prediction capability given different instances of the timespan defined as subtasks in CMTL 0 and compared the performance with standalone methods. The 3 5 7 Subtask experimental design considered multiples experiment runs, dif- ferent and distinct datasets, difference in size of the datasets, Figure 11: Performance given by EA, CNE, CMTL for South Indian ocean and different sets of initialisation in the subpopulations. In this way, we have addressed robustness and convergence of the pro- CMTL can be seen as a flexible method for datasets that posed method, experimentally. have different features of which some have properties so that In the comparison of CMTL with CNE, we observed that they can be grouped together as subtasks. Through multi-task CMTL creates a higher time complexity since it has an addi- learning, the overlapping features can be used as building blocks tional step of transfer of solutions from the different subtasks to learn the nature of the problem through the model at hand. encoded in the subpopulations. The time taken would increase Although feedforward neural networks have been used in CMTL, exponentially as the number of subtasks increases. This would other other neural network architectures and learning models add to cost of utilising solutions from other subtasks given a can be used depending on the nature of the subtasks. In case fixed convergence criteria defined by the number of function of computer vision applications such as face recognition, the evaluations. In case if the convergence criteria is defined by a

12 South-Indian-Ocean Prediction

Actual 0.6 Prediction-ST1 Residual-ST1 Prediction-ST2 Residual-ST2 0.5 Prediction-ST3 Residual-ST3

0.4

0.3

0.2 Prediction (x 180 knots) 0.1

0

-0.1

0 200 400 600 800 1000 Time (x 6 hours)

Figure 13: Typical prediction performance given by different CMTL subtasks (ST1, ST2, and ST3) for the South Indian ocean

South-Pacific-Ocean Prediction

Actual 0.6 Prediction-ST1 Residual-ST1 Prediction-ST2 Residual-ST2 0.5 Prediction-ST3 Residual-ST3

0.4

0.3

0.2 Prediction (x 180 knots) 0.1

0

-0.1

0 200 400 600 800 1000 Time (x 6 hours)

Figure 14: Typical prediction performance given by different CMTL subtasks (ST1, ST2, and ST3) for the South Pacific ocean minimum error or loss, it is likely that the solutions from the tions due to slow convergence of evolutionary algorithms. With previous subtasks will help in faster convergence. In terms of help of gradient-based local search methods, convergence of scalability, we note that neuroevolution methods have limita- CMTL can be improved via memetic algorithms where local

13 refinement occurs during the evolution [99]. There is scope can be seen as features that include cyclone tracks, seas surface in future work convergence proof used in standard evolution- temperature, and humidity. ary algorithms [100] that could further be extended for multiple sub-populations in CMTL. References References 5. Conclusions and Future Work [1] D. Mirikitani and N. Nikolaev, “Recursive bayesian recurrent neural net- We presented a novel algorithm that provides a synergy be- works for time-series modeling,” Neural Networks, IEEE Transactions on, vol. 21, no. 2, pp. 262 –274, Feb. 2010. tween coevolutionary algorithms and multi-tasking for dynamic [2] M. Ardalani-Farsa and S. Zolfaghari, “Chaotic time series prediction time series problems. CMTL can be used to train a model that with residual analysis method using hybrid Elman-NARX neural net- can handle multiple timespan values that defines dynamic input works,” Neurocomputing, vol. 73, no. 13-15, pp. 2540 – 2553, 2010. features which provides dynamic prediction. CMTL has been [3] K. K. Teo, L. Wang, and Z. Lin, “Wavelet packet multi-layer perceptron for chaotic time series prediction: Effects of weight initialization,” in very beneficial for tropical cyclones where timely prediction Proceedings of the International Conference on Computational Science- needs to be made as soon as the event takes place. The results Part II, ser. ICCS ’01, 2001, pp. 310–317. show that CMTL addresses the problem of dynamic time series [4] A. Gholipour, B. N. Araabi, and C. Lucas, “Predicting chaotic time se- and provides a robust way to improve the performance when ries using neural and neurofuzzy models: A comparative study,” Neural Process. Lett., vol. 24, pp. 217–239, 2006. compared to related methods. [5] D. Ruta and B. Gabrys, “Neural network ensembles for time series pre- It is important to understand how CMTL achieved better re- diction,” in 2007 International Joint Conference on Neural Networks, sults for when compared to related methods given the same neu- 2007, pp. 1204–1209. [6] C.-J. Lin, C.-H. Chen, and C.-T. Lin, “A hybrid of cooperative particle ral network topology and data for respective subtasks. CMTL swarm optimization and cultural algorithm for neural fuzzy networks can be seen as an incremental evolutionary learning method and its prediction applications,” Systems, Man, and Cybernetics, Part that features subtasks as building blocks of knowledge. The C: Applications and Reviews, IEEE Transactions on, vol. 39, no. 1, pp. larger subtasks take advantage of knowledge gained from learn- 55–68, Jan. 2009. [7] R. Chandra and M. Zhang, “Cooperative coevolution of Elman recurrent ing the smaller subtasks. Hence, there is diversity in incremen- neural networks for chaotic time series prediction,” Neurocomputing, tal knowledge development from the base subtask which seems vol. 186, pp. 116 – 123, 2012. to be beneficial for future subtasks. However, the reason why [8] R. Chandra, “Competition and collaboration in cooperative coevolution the base subtask produces better results when compared to con- of Elman recurrent neural networks for time-series prediction,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 26, pp. ventional learning can be explored with further analysis during 3123–3136, 2015. learning. The larger subtasks with overlapping features cov- [9] F. Takens, “Detecting strange attractors in turbulence,” in Dynamical ers the base subtask in a cascaded manner. Therefore, larger Systems and Turbulence, Warwick 1980, ser. Lecture Notes in Mathe- subtasks can be seen as those that have additional features that matics, 1981, pp. 366–381. [10] R. Nand and R. Chandra, “Coevolutionary feature selection and re- guide larger network with more hidden neurons during training. construction in neuro-evolution for time series prediction,” in Artificial Finally, CMTL is a novel approach that provides synergy Life and Computational Intelligence - Second Australasian Conference, of a wide range of fundamental methods that include, dynamic ACALCI 2016, Canberra, ACT, Australia, February 2-5, 2016, Proceed- programming, reinforcement learning, multi-task learning, co- ings, 2016, pp. 285–297. [11] S. Chand and R. Chandra, “Multi-objective cooperative coevolution of evolutionary algorithms and neuro-evolution. This makes the neural networks for time series prediction,” in International Joint Con- CMTL useful for some of the applications where the mentioned ference on Neural Networks (IJCNN), Beijing, China, July 2014, pp. fundamental methods have been successful. Since reinforce- 190–197. [12] A. Maus and J. Sprott, “Neural network method for determining embed- ment learning has been utilised in deep learning, the notion of ding dimension of a time series,” Communications in Nonlinear Science reuse of knowledge as building bocks by CMTL could be ap- and Numerical Simulation, vol. 16, no. 8, pp. 3294 – 3302, 2011. plicable in areas of deep learning. In future work, apart from [13] M. M. Ali, P. S. V. Jagadeesh, I. I. Lin, and J. Y. Hsu, “A neural network feed-forward networks, the idea of dynamic time series pre- approach to estimate tropical cyclone heat potential in the indian ocean,” IEEE Geoscience and Remote Sensing Letters, vol. 9, no. 6, pp. 1114– diction that employs transfer and multi-task learning could be 1117, Nov 2012. extended to other other areas that have simpler model repre- [14] L. Zjavka, “Numerical weather prediction revisions using the locally sentation, such as autoregressive models. Furthermore, CMTL trained differential polynomial network,” Expert Systems with Applica- can be used for other problems that can be broken into multi- tions, vol. 44, pp. 265 – 274, 2016. [15] B. W. Stiles, R. E. Danielson, W. L. Poulsen, M. J. Brennan, S. Hristova- ple subtasks, such as multiple step ahead and multivariate time Veleva, T. P. Shen, and A. G. Fore, “Optimized tropical cyclone winds series prediction. Although the paper explored timespan for from quikscat: A neural network approach,” IEEE Transactions on Geo- univariate time series, the approach could be extended to pat- science and Remote Sensing, vol. 52, no. 11, pp. 7418–7434, Nov 2014. tern classification problems that involves large scale features. It [16] R. Deo and R. Chandra, “Identification of minimal timespan prob- lem for recurrent neural networks with application to cyclone wind- can be extended for heterogeneous pattern classification prob- intensity prediction,” in International Joint Conference on Neural Net- lems where the dataset contains samples that have missing val- works (IJCNN), Vancouver, Canada, July 2016, p. In Press. ues or features. CMTL can also be extended for transfer learn- [17] S. B. Taieb and A. F. Atiya, “A bias and variance analysis for multistep- ing problems that can include both heterogeneous and homoge- ahead time series forecasting,” 2015. [18] L.-C. Chang, P.-A. Chen, and F.-J. Chang, “Reinforced two-step-ahead neous domain adaptation cases. In case of tropical cyclones, a weight adjustment technique for online training of recurrent neural net- multivariate approach can be taken where the different subtasks works,” Neural Networks and Learning Systems, IEEE Transactions on, vol. 23, no. 8, pp. 1269–1278, 2012. 14 [19] R. Bone´ and M. Crucianu, “Multi-step-ahead prediction with neural net- 2009, pp. 745–752. [Online]. Available: http://papers.nips.cc/paper/ works: a review,” 9emes rencontres internationales: Approches Connex- 3499-clustered-multi-task-learning-a-convex-formulation.pdf ionnistes en Sciences, vol. 2, pp. 97–106, 2002. [40] J. Chen, L. Tang, J. Liu, and J. Ye, “A convex formulation for learning [20] K. Chakraborty, K. Mehrotra, C. K. Mohan, and S. Ranka, “Forecasting shared structures from multiple tasks,” in Proceedings of the 26th the behavior of multivariate time series using neural networks,” Neural Annual International Conference on Machine Learning, ser. ICML ’09. Networks, vol. 5, no. 6, pp. 961 – 970, 1992. New York, NY, USA: ACM, 2009, pp. 137–144. [Online]. Available: [21] L. Wang, Z. Wang, and S. Liu, “An effective multivariate time series http://doi.acm.org/10.1145/1553374.1553392 classification approach using echo state network and adaptive differen- [41] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via tial evolution algorithm,” Expert Systems with Applications, vol. 43, pp. alternating structure optimization,” in Advances in Neural Information 237 – 249, 2016. Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, [22] S. Zhang, “Adaptive spectral estimation for nonstationary multivariate F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., time series,” Computational Statistics and Data Analysis, vol. 103, pp. 2011, pp. 702–710. [Online]. Available: http://papers.nips.cc/paper/ 330 – 349, 2016. 4292-clustered-multi-task-learning-via-alternating-structure-optimization. [23] R. Caruana, “Multitask learning,” Machine Learning, vol. 28, no. 1, pp. pdf 41–75, Jul. 1997. [42] Y. Zhang and D.-Y. Yeung, “Transfer metric learning by learning task [24] T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple tasks relationships,” in Proceedings of the 16th ACM SIGKDD International with kernel methods,” Journal of Machine Learning Research, vol. 6, Conference on Knowledge Discovery and Data Mining, ser. KDD ’10. no. Apr, pp. 615–637, 2005. New York, NY, USA: ACM, 2010, pp. 1199–1208. [Online]. Available: [25] H. Zheng, X. Geng, D. Tao, and Z. Jin, “A multi-task model for simulta- http://doi.acm.org/10.1145/1835804.1835954 neous face identification and facial expression recognition,” Neurocom- [43] B. Bakker and T. Heskes, “Task clustering and gating for bayesian puting, vol. 171, pp. 515 – 523, 2016. multitask learning,” J. Mach. Learn. Res., vol. 4, pp. 83–99, Dec. 2003. [26] T. Zeng and S. Ji, “Deep convolutional neural networks for multi- [Online]. Available: http://dx.doi.org/10.1162/153244304322765658 instance multi-task learning,” in Data Mining (ICDM), 2015 IEEE In- [44] S. Zhong, J. Pu, Y.-G. Jiang, R. Feng, and X. Xue, “Flexible multi-task ternational Conference on, Nov 2015, pp. 579–588. learning with latent task grouping,” Neurocomputing, vol. 189, pp. [27] B. L. Happel and J. M. Murre, “Design and evolution of modular neural 179 – 188, 2016. [Online]. Available: http://www.sciencedirect.com/ network architectures,” Neural Networks, vol. 7, no. 67, pp. 985 – 1004, science/article/pii/S0925231216000035 1994, models of Neurodynamics and Behavior. [Online]. Available: [45] X. Tang, Q. Miao, Y. Quan, J. Tang, and K. Deng, “Predicting http://www.sciencedirect.com/science/article/pii/S0893608005801558 individual retweet behavior by user similarity: A multi-task learning [28] J. Clune, J.-B. Mouret, and H. Lipson, “The evolutionary origins of mod- approach,” Knowledge-Based Systems, vol. 89, pp. 681 – 688, 2015. ularity,” Proceedings of the Royal Society of London B: Biological Sci- [Online]. Available: http://www.sciencedirect.com/science/article/pii/ ences, vol. 280, no. 1755, 2013. S0950705115003470 [29] K. O. Ellefsen, J.-B. Mouret, and J. Clune, “Neural modularity helps [46] A. Liu, Y. Lu, W. Nie, Y. Su, and Z. Yang, “Hep-2 cells classification organisms evolve to learn new skills without forgetting old skills,” PLoS via clustered multi-task learning,” Neurocomputing, vol. 195, pp. Comput Biol, vol. 11, no. 4, pp. 1–24, 04 2015. 195 – 201, 2016, learning for Medical Imaging. [Online]. Available: [30] J. N. Tsitsiklis and B. V. Roy, “Neuro-dynamic programming overview http://www.sciencedirect.com/science/article/pii/S0925231216001235 and a case study in optimal stopping,” in Decision and Control, 1997., [47] X. Qin, X. Tan, and S. Chen, “Mixed bi-subject kinship verification Proceedings of the 36th IEEE Conference on, vol. 2, Dec 1997, pp. via multi-view multi-task learning,” Neurocomputing, pp. –, 2016. 1181–1186 vol.2. [Online]. Available: http://www.sciencedirect.com/science/article/pii/ [31] X. Fang, D. Zheng, H. He, and Z. Ni, “Data-driven heuristic dynamic S0925231216306658 programming with virtual reality,” Neurocomputing, vol. 166, pp. 244 – [48] S. Zhang, Y. Sui, S. Zhao, X. Yu, and L. Zhang, “Multi-local- 255, 2015. task learning with global regularization for object tracking,” Pattern [32] M. Potter and K. De Jong, “A cooperative coevolutionary approach to Recognition, vol. 48, no. 12, pp. 3881 – 3894, 2015. [Online]. Available: function optimization,” in Parallel Problem Solving from Nature PPSN http://www.sciencedirect.com/science/article/pii/S0031320315002265 III, ser. Lecture Notes in Computer Science, Y. Davidor, H.-P. Schwefel, [49] P. Angeline, G. Saunders, and J. Pollack, “An evolutionary algorithm and R. Mnner, Eds. Springer Berlin Heidelberg, 1994, vol. 866, pp. that constructs recurrent neural networks,” Neural Networks, IEEE 249–257. Transactions on, vol. 5, no. 1, pp. 54 –65, jan 1994. [33] M. A. Potter and K. A. De Jong, “Cooperative coevolution: An architec- [50] D. E. Moriarty and R. Miikkulainen, “Forming neural networks ture for evolving coadapted subcomponents,” Evol. Comput., vol. 8, pp. through efficient and adaptive coevolution,” , 1–29, 2000. vol. 5, no. 4, pp. 373–399, 1997. [Online]. Available: http: [34] A. Gupta, Y. S. Ong, and L. Feng, “Multifactorial evolution: To- //www.mitpressjournals.org/doi/abs/10.1162/evco.1997.5.4.373 ward evolutionary multitasking,” IEEE Trans. Evolutionary Computa- [51] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through tion, vol. 20, no. 3, pp. 343–357, 2016. augmenting topologies,” Evolutionary Computation, vol. 10, no. 2, pp. [35] Y. S. Ong and A. Gupta, “Evolutionary multitasking: A computer sci- 99–127, 2002. ence view of cognitive multitasking,” Cognitive Computation, vol. 8, [52] F. Gomez, J. Schmidhuber, and R. Miikkulainen, “Accelerated neural no. 2, pp. 125–142, 2016. evolution through cooperatively coevolved synapses,” J. Mach. Learn. [36] R. Chandra, A. Gupta, Y. S. Ong, and C. K. Goh, “Evolutionary Res., vol. 9, pp. 937–965, 2008. multi-task learning for modular knowledge representation in neural [53] V. Heidrich-Meisner and C. Igel, “Neuroevolution strategies for networks,” Neural Processing Letters, 2018. [Online]. Available: episodic reinforcement learning,” Journal of Algorithms, vol. 64, https://doi.org/10.1007/s11063-017-9718-z no. 4, pp. 152 – 168, 2009, special Issue: Reinforcement [37] R. Chandra, “Dynamic cyclone wind-intensity prediction using co- Learning. [Online]. Available: http://www.sciencedirect.com/science/ evolutionary multi-task learning,” in Neural Information Processing, article/B6WH3-4W7RY8J-3/2/22f7075bc25dab10a8ff3714e2fee303 D. Liu, S. Xie, Y. Li, D. Zhao, and E.-S. M. El-Alfy, Eds., 2017, pp. [54] N. Garca-Pedrajas, C. Hervas-Martinez, and J. Munoz-Perez, “Multi- 618–627. objective cooperative coevolution of artificial neural networks (multi- [38] R. K. Ando and T. Zhang, “A framework for learning predictive objective cooperative networks),” Neural Networks, vol. 15, pp. 1259– structures from multiple tasks and unlabeled data,” J. Mach. Learn. 1278, 2002. Res., vol. 6, pp. 1817–1853, Dec. 2005. [Online]. Available: [55] R. Chandra, M. Frean, and M. Zhang, “On the issue of separability for http://dl.acm.org/citation.cfm?id=1046920.1194905 problem decomposition in cooperative neuro-evolution,” Neurocomput- [39] L. Jacob, J. philippe Vert, and F. R. Bach, “Clustered multi- ing, vol. 87, pp. 33–40, 2012. task learning: A convex formulation,” in Advances in Neural [56] ——, “An encoding scheme for cooperative coevolutionary neural net- Information Processing Systems 21, D. Koller, D. Schuurmans, works,” in 23rd Australian Joint Conference on Artificial Intelligence, Y. Bengio, and L. Bottou, Eds. Curran Associates, Inc., ser. Lecture Notes in Artificial Intelligence. Adelaide, Australia:

15 Springer-Verlag, 2010, pp. 253–262. Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, July [57] R. Chandra, M. Frean, M. Zhang, and C. W. Omlin, “Encoding sub- 2015, pp. 721–728. components in cooperative co-evolutionary recurrent neural networks,” [80] M. Chayama and Y. Hirata, “When univariate model-free time Neurocomputing, vol. 74, no. 17, pp. 3223 – 3234, 2011. series prediction is better than multivariate,” Physics Letters A, vol. [58] F. Gomez and R. Mikkulainen, “Incremental evolution of complex gen- 380, no. 3132, pp. 2359 – 2365, 2016. [Online]. Available: eral behavior,” Adapt. Behav., vol. 5, no. 3-4, pp. 317–342, 1997. http://www.sciencedirect.com/science/article/pii/S0375960116302195 [59] F. J. Gomez, “Robust non-linear control through neuroevolution,” PhD [81] X. Wu, Y. Wang, J. Mao, Z. Du, and C. Li, “Multi-step prediction Thesis, Department of Computer Science, The University of Texas at of time series with random missing data,” Applied Mathematical Austin, Technical Report AI-TR-03-303, 2003. Modelling, vol. 38, no. 14, pp. 3512 – 3522, 2014. [Online]. Available: [60] R. A. Howard, “Dynamic programming,” Management Science, vol. 12, http://www.sciencedirect.com/science/article/pii/S0307904X13007658 no. 5, pp. 317–348, 1966. [82] R. L. Smith, “Extreme value analysis of environmental time series: an [61] M. Held and R. M. Karp, “A dynamic programming approach to se- application to trend detection in ground-level ozone,” Statistical Science, quencing problems,” Journal of the Society for Industrial and Applied pp. 367–377, 1989. Mathematics, vol. 10, no. 1, pp. 196–210, 1962. [83] P. D’Urso, E. A. Maharaj, and A. M. Alonso, “Fuzzy clustering of time [62] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization series using extremes,” Fuzzy Sets and Systems, vol. 318, pp. 56 – 79, for spoken word recognition,” IEEE transactions on acoustics, speech, 2017, theme: Clustering and Image Processing. and signal processing, vol. 26, no. 1, pp. 43–49, 1978. [84] N. Chouikhi, B. Ammar, N. Rokbani, and A. M. Alimi, “Pso- [63] A. A. Amini, T. E. Weymouth, and R. C. Jain, “Using dynamic program- based analysis of echo state network parameters for time series ming for solving variational problems in vision,” IEEE Transactions on forecasting,” Applied Soft Computing, vol. 55, pp. 211 – 225, 2017. pattern analysis and machine intelligence, vol. 12, no. 9, pp. 855–867, [Online]. Available: http://www.sciencedirect.com/science/article/pii/ 1990. S1568494617300649 [64] R. S. Sutton, “Introduction: The challenge of reinforcement learning,” [85] C. Frazier and K. Kockelman, “Chaos theory and transportation systems: in Reinforcement Learning. Springer, 1992, pp. 1–3. Instructive example,” Transportation Research Record: Journal of the [65] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learn- Transportation Research Board, vol. 20, pp. 9–17, 2004. ing: A survey,” Journal of artificial intelligence research, vol. 4, pp. [86] R. A. Calvo, H. D. Navone, and H. A. Ceccatto, Neural Network Analy- 237–285, 1996. sis of Time Series: Applications to Climatic Data. Berlin, Heidelberg: [66] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas, Springer Berlin Heidelberg, 2000, pp. 7–16. Dynamic programming and optimal control. Athena scientific Belmont, [87] Y. Wang, W. Zhang, and W. Fu, “Back propogation(bp)-neural network MA, 1995, vol. 1, no. 2. for tropical cyclone track forecast,” in Geoinformatics, 2011 19th Inter- [67] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic programming: an national Conference on, June 2011, pp. 1–4. overview,” in Decision and Control, 1995., Proceedings of the 34th [88] (2015) JTWC tropical cyclone best track data site. IEEE Conference on, vol. 1. IEEE, 1995, pp. 560–564. [89] M. Mackey and L. Glass, “Oscillation and chaos in physiological control [68] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. systems,” Science, vol. 197, no. 4300, pp. 287–289, 1977. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski [90] E. Lorenz, “Deterministic non-periodic flows,” Journal of Atmospheric et al., “Human-level control through deep reinforcement learning,” Na- Science, vol. 20, pp. 267 – 285, 1963. ture, vol. 518, no. 7540, p. 529, 2015. [91]M.H enon,´ “A two-dimensional mapping with a strange attractor,” Com- [69] D. E. Moriarty, A. C. Schultz, and J. J. Grefenstette, “Evolutionary al- munications in Mathematical Physics, vol. 50, no. 1, pp. 69–77, 1976. gorithms for reinforcement learning,” Journal of Artifical Intelligence [92] O. Rssler, “An equation for continuous chaos,” Physics Letters A, Research JAIR, vol. 11, pp. 241 – 276, 1999. vol. 57, no. 5, pp. 397 – 398, 1976. [70] N. Richards, D. E. Moriarty, and R. Miikkulainen, “Evolving neural net- [93] S. S., “Solar cycle forecasting: A nonlinear dynamics approach,” As- works to play go,” Applied Intelligence, vol. 8, no. 1, pp. 85–96, 1998. tronomy and Astrophysics, vol. 377, pp. 312–320, 2001. [71] K. P. Bennett and E. Parrado-Hernandez,´ “The interplay of optimization [94] “NASDAQ Exchange Daily: 1970-2010 Open, Close, High, Low and machine learning research,” Journal of Machine Learning Research, and Volume,” accessed: 02-02-2015. [Online]. Available: http: vol. 7, no. Jul, pp. 1265–1281, 2006. //www.nasdaq.com/symbol/aciw/stock-chart [72] A. Guillory, E. Chastain, and J. Bilmes, “Active learning as non-convex [95] “Sante Fe Competition Data.” [Online]. Avail- optimization,” in Artificial Intelligence and Statistics, 2009, pp. 201– able: http://www-psych.stanford.edu/∼andreas/Time-Series/SantaFe. 208. html,note={Accessed:30-06-2016} [73] T. Koskela, M. Lehtokangas, J. Saarinen, and K. Kaski, “Time series pre- [96] N. Hansen, S. D. Muller,¨ and P. Koumoutsakos, “Reducing diction with multilayer perceptron, FIR and Elman neural networks,” in the time complexity of the derandomized with In Proceedings of the World Congress on Neural Networks, San Diego, covariance matrix adaptation (CMA-ES),” Evolutionary Computation, CA, USA, 1996, pp. 491–496. vol. 11, no. 1, pp. 1–18, 2003. [Online]. Available: http: [74] H. Sandya, P. Hemanth Kumar, and S. B. Patil, “Feature extraction, clas- //dx.doi.org/10.1162/106365603321828970 sification and forecasting of time series signal using fuzzy and garch [97] R. Chandra, “Competition and collaboration in cooperative coevolution techniques,” in Research & Technology in the Coming Decades (CRT of elman recurrent neural networks for time-series prediction,” IEEE 2013), National Conference on Challenges in. IET, 2013, pp. 1–7. Trans. Neural Netw. Learning Syst., vol. 26, no. 12, pp. 3123–3136, [75] L. Zhang, W.-D. Zhou, P.-C. Chang, J.-W. Yang, and F.-Z. Li, “Iterated 2015. [Online]. Available: http://dx.doi.org/10.1109/TNNLS.2015. time series prediction with multiple support vector regression models,” 2404823 Neurocomputing, vol. 99, pp. 411–422, 2013. [98] J. Misra and I. Saha, “Artificial neural networks in hardware: A survey [76] S. Ben Taieb and R. Hyndman, “Recursive and direct multi-step fore- of two decades of progress,” Neurocomputing, vol. 74, no. 13, pp. 239 – casting: the best of both worlds,” Monash University, Department of 255, 2010, artificial Brains. Econometrics and Business Statistics, Tech. Rep., 2012. [99] X. Chen, Y. S. Ong, M. H. Lim, and K. C. Tan, “A multi-facet survey [77] A. Grigorievskiy, Y. Miche, A.-M. Ventel, E. Sverin, and A. Lendasse, on memetic computation,” IEEE Transactions on Evolutionary Compu- “Long-term time series prediction using op-elm,” Neural Networks, tation, vol. 15, no. 5, pp. 591–607, 2011. vol. 51, pp. 50 – 56, 2014. [100] A. E. Eiben, E. H. L. Aarts, and K. M. v. Hee, “Global convergence of [78] Y. Yin and P. Shang, “Forecasting traffic time series with multivariate genetic algorithms: A markov chain analysis,” in Proceedings of the 1st predicting method,” Applied Mathematics and Computation, vol. 291, Workshop on Parallel Problem Solving from Nature, ser. PPSN, 1991, pp. 266 – 278, 2016. [Online]. Available: http://www.sciencedirect. pp. 4–12. com/science/article/pii/S0096300316304477 [79] R. Chandra, K. Dayal, and N. Rollings, “Application of cooperative neuro-evolution of Elman recurrent networks for a two-dimensional cy- clone track prediction for the South Pacific region,” in International

16