Brain Emotional Learning-Based Prediction Model For Long-Term Chaotic Prediction Applications

M. Parsapoor

 emotional processing and is therefore named BELPM that Abstract— This study suggests a new prediction model for stands for Brain Emotional Learning-Based Prediction Model. chaotic time series inspired by the brain emotional learning of Its simple architecture has been inspired by the brain mammals. We describe the structure and function of this model, emotional system, emphasizing the interaction between those which is referred to as BELPM (Brain Emotional Learning- Based Prediction Model). Structurally, the model mimics the parts that have significant roles in emotional learning. The connection between the regions of the , and model mimics the emotional learning by merging weighted k- functionally it uses weighted k nearest neighbors to imitate the Nearest Neighbor (Wk-NN) method and adaptive neural roles of those regions. The learning of BELPM is networks. The learning algorithm of BELPM is based on the defined using steepest descent (SD) and the least square estimator steepest descent (SD) and the least square estimator (LSE). (LSE). Two benchmark chaotic time series, Lorenz and Henon, The model aims at continuing the recent studies that have have been used to evaluate the performance of BELPM. The obtained results have been compared with those of other suggested computational models of emotional processing for prediction methods. The results show that BELPM has the control and prediction applications [6]-[16]. capability to achieve a reasonable accuracy for long-term The rest of the paper is organized as follows: Section II prediction of chaotic time series, using a limited amount of gives an introduction of emotional processing in the brain, training data and a reasonably low computational time. reviews the anatomical aspects of emotional learning, and gives an overview of related work in the area of computational Index Terms—Brain emotional learning, Chaotic time series, models of brain emotional learning. Section III describes the Long-term prediction, Weighted k-nearest-neighbors. BELPM’s architecture and illustrates its learning algorithm. In Section IV, two benchmark chaotic time series, Lorenz and I. INTRODUCTION Henon, are used to evaluate the performance of BELPM and rediction models have applications in many different the results are compared with the obtained results from other P areas of science and technology: business, economics, nonlinear learning methods (e.g. Adaptive-Network-Based healthcare, and welfare services. Well-known data-driven Fuzzy Inference System (ANFIS) [3], MultiLayer Perceptron methodologies such as neural networks and neuro-fuzzy (MLP) Network [1], [2], Radial Bias Function (RBF) models have shown reasonable accuracy in the nonlinear Networks [1], [2], and Local Linear Neuro-Fuzzy (LLNF) prediction of chaotic time series [1]. According to the Vapnik- Models [1]). Finally, conclusions about the BELPM model Chervonenkis (VC) theory [2], a sufficient size of training and further improvements to the model are stated in Section V. samples for achieving arbitrary prediction accuracy is proportional to the number of the models’ learning parameters. II. BACKGROUND Thus, a data-driven model with high model complexity and high number of learning parameters requires a large number of One of the most challenging topics in machine learning training samples to achieve high accuracy in a prediction research area is development of high generalization application. The prediction accuracy of predicting chaotic time for accurate prediction of chaotic systems. Recently, bio- series depends on the characteristics of the applied models inspired models, in particular, -based learning models (i.e., the model complexity and the number of learning [6], [8], [10], [12]-[14], have shown acceptable generalization parameters) as well as the features of the prediction capability in modeling and predicting the chaotic behavior of applications (i.e., the number of training samples, chaos dynamic systems. In fact, this capability is obtained in degree, embedding dimension, and the horizon of prediction) emotion-based learning models by integrating machine [1]–[5]. learning algorithms with the computational model of This study suggests a prediction model that could be emotional learning. In the following, we explain how feasible for long-term prediction of chaotic systems (chaotic emotional learning can be modeled as a computer-based tool time series) with a limited amount of training data samples; and how it can be integrated with learning algorithms. The i.e., a prediction model with a small number of learning difference between BELPM and several well-known data- parameters. The suggested model is inspired by the brain driven methods will also be indicated. A. Related Works in Modeling Emotional Learning

Since 1988, emotion and emotional processing have been active research topics for neuroscientists and psychologists. A future reinforcement [31], and forming an association between lot of efforts have been made to analyze emotional behavior neutral stimuli and emotionally charged stimuli [30]. and describe emotion on the basis of different hypotheses, e.g., psychological, neurobiological, philosophy, and learning hypothesis. These hypotheses that have contributed to the present computer-based models of emotional processing [18], Lateral of have imitated certain aspects of emotional learning, and can be Amygdala

Basolateral of classified on the basis of their fundamental theories and Basal of Amygdala Amygdala applications. For example, a computer-based model that is Thalamous Orbitofrontal based on the central theory [18] (i.e., which explains how a Accesory Basal of AMYGDALA Amygdala primary evaluation of emotional stimuli forms emotional Emotional Stimulus experiences) is called a computational model of emotional Centromedial of Amygdala learning and imitates the associative learning aspect of emotional processing [18] that is based on fear conditioning Emotional Response [19], [ 20]. Fig.1.The parts of amygdala and their pathways. The diagram shows the Emotional processing has also been described using pattern of fear conditioning, which needs to be clarified [17]. different anatomical structures: “MacLean’s limbic system, Cannon’s structure, and Papez circuit” [21]. The first The two main parts of the amygdala are the basolateral part (the largest portion of the amygdala) and the centeromedial anatomical representation is based on studies on cognitive part. The basolateral part has the bidirectional link to the neuroscience [21], [22], and has been developed on the basis insular cortex and orbital cortex [18], [20], [21], [25], [26] and of Cannon’s structure and Papez circuit emphasizing the role performs the main role in mediating memory consolidation of the limbic system (i.e., a group of the brain regions) for [32] and providing the primary response, and is divided into emotional processing. The Cannon’s structure suggested that three parts: the lateral, basal, and accessory basal [25], [29]. the hypothalamus of the brain plays the most significant role The lateral is the part through which stimuli enter the in emotional learning, while Papez circuit [21] emphasized the amygdala. The lateral region not only passes the stimuli to role of the cingulate cortex in emotional processing. In this other regions, but also memorizes them to form the stimulus– study, we have focused on the limbic system, which is the response association [31]. This part also takes some roles in basis of our suggested model. spreading the sensor’s information to other parts, forming the association between the conditioned and unconditioned 1) Anatomical Structure of Emotional Learning stimuli, inhabiting and reflecting the external stimuli, and The limbic system is a group of brain regions, which memorizing the emotional experiences. The basal and includes the hippocampus, amygdala, thalamus, and sensory accessory basal parts participate in mediating the contextual cortex [22]-[24]. The roles of the main regions of the limbic conditioning [20], [25]. system with regard to emotional learning can be summarized The centeromedial part, which is the main output for the as follows: basaloteral part [26], is divided into the central and medial a) Thalamus receives emotional stimuli and is responsible parts [18], [20], [25], [26]. It is responsible for the hormonal for the provision of high-level information, i.e., determining aspects of emotional reactions [25] or for mediating the the effective values of stimuli [22]-[28]. It then passes the expression of the emotional responses [25], [26]. generated signals to the amygdala and sensory cortex [28]. d) is located close to the amygdala and The thalamus includes different parts that process the has a bidirectional connection to the amygdala. This part plays emotional stimuli separately [27]. roles in processing stimulus [25], decoding the primary b) Sensory cortex is a part of the sensory area of the brain reinforcement, representing the negative reinforcement, and and is responsible for analysis and processing of the received learning the stimulus–reinforcement association. It also signals. The sensory cortex distributes its output signals evaluates and corrects reward and punishment [18]-[21], [33]- between the amygdala and orbitofrontal region [18]-[21], [27]. [37], selects goals, makes decisions for a quick response to c) Amygdala is the central part of the limbic system of punishment [18], [23], [25]-[36], and prevents inappropriate mammals and has a principal role in emotional learning [18]- responses of the amygdala. The orbitofrontal cortex [26]. The amygdala consists of several parts with different encompasses two parts, the medial and lateral. The medial part functional roles (see Fig. 1), and it connects through them to forms and memorizes reinforcement–stimulus association, and other regions of the brain (e.g., the insular cortex, orbital also has role in providing responses and monitoring them, cortex, and frontal lobe). It has connections to the thalamus, whereas the lateral part evaluates the response and provides orbitofrontal cortex, and hypothalamus [25], [26]. During punishment [37]. emotional learning, the amygdala participates in reacting to emotional stimuli, storing emotional responses [29], 2) Emotion-Inspired Computational Models evaluating positive and negative reinforcement [30], learning Computational models of emotional learning, which are the association between unconditioned and conditioned stimuli computer-based models, have been developed to represent the [19], [20], [31], predicting the association between stimuli and associative learning aspect of emotional processing. From the application perspective, they can be categorized into three The fundamental model of emotion-based machine-learning groups: emotion-based decision-making model, emotion-based approaches and emotion-based controllers is the amygdala- controller, and emotion-based machine-learning approaches. orbitofrontal system. It is a type of computational model of a) Emotion-based decision-making model: This model is the emotional learning with a simple structure that has been basis of artificial intelligent (AI) emotional agent that defined using the theory of the limbic system [23]. Its integrates emotional reactions with rational reactions. EMAI structure is inherited from some parts of the limbic system (Emotionally Motivated Artificial Intelligence) was one of the (e.g., the amygdala, thalamus, sensory cortex), and imitates first attempts to develop emotion-based agents. It was applied the interaction between those parts of the limbic systems and for simulating artificial soccer playing [38], and its results formulates the emotional response using mathematical equations [23]. The amygdala-orbitofrontal subsystem consists were fairly good. The Cathexis model [39] was another of two subsystems: the amygdala and orbitofrontal subsystem. emotional agent developed that reacted to an environment by Each subsystem has several linear neurons and receives a imitating an emotional decision-making process in humans. (a reward). The model’s output function has been The model of the mind [40] was developed as a modular defined as subtracting the orbitofrontal’s response from the artificial agent to generate emotional behavior for making amygdala’s response. To update the weights, learning rules are decisions. An agent architecture that was called Emotion- defined for both the amygdala and orbitofrontal subsystem. based Robotic Agent Development (in reverse order, DARE) Due to its simplicity, it has been the basis of most controllers was developed on the basis of the somatic marker theory; it and prediction models inspired by emotional learning. was tested in a multi-agent system and showed ability in modeling social and emotional behavior [41]. B. A Brief Overview of the Data- Driven Methods b) Emotion-based controller: The first practical One straightforward way to evaluate the data-driven implementation of an emotion-based controller is BELBIC learning approaches (e.g., neural network and neuro-fuzzy (Brain Emotional Learning-Based Intelligent Controller) [7]. It models) is to apply them to predict chaotic time series. was developed on the basis of Moren and Balkenius Different types of neural network and neuro fuzzy models computational model [7], [23], [42]. The BELBIC has been (e.g., RBF, ANFIS, LLNF) have been applied to model and successfully employed for a number of applications: predict the short-term and long-term behavior of chaotic time controlling heating and air conditioning [43] of aerospace series (e.g., Lorenz, Henon, Mackey-Glass, Ikeda) [1]-[15], launch vehicles [44], intelligent washing machines [45], and [51]-[55]. trajectory tracking of stepper motor [46]. Another emotion- We have suggested a model that differs from the previously based intelligent controller is a neuro-fuzzy controller [47], proposed models in terms of prediction accuracy, structural which was integrated with emotion-based performance simplicity, and generalization capability. In the following, we measurement to tune the parameters of the controller. explain the differences between BELPM and other well- Application of an emotion-based controller robotics was known data-driven models. proposed in [48], which is an interesting example of applying 1) Radial Bias Function (RBF) differs from BELPM in emotional concepts in robotic applications and imitated the terms of the underlying structure, inputs of the neurons, reinforcement learning aspect of emotional processing. The connection between neurons, and number of learning results of applying emotion-based controllers have shown that parameters and learning algorithms. they have the capability to overcome uncertainty and 2) Generalization Regression Neural Network (GRNN) [1] complexity issues of control applications. Specifically, the differs from BELPM in its number of neurons (i.e., the BELBIC has been proven to outperform others in terms of number of neurons of GRNN are equal to the size of training simplicity, reliability, and stability [7], [39]-[42]. samples). Moreover, GRNN has no learning algorithm to c) Emotion-based machine-learning approach: Developing optimize its performance and increase its generalization machine-learning approaches by imitating emotional capability. processing of the brain has captured the attention of 3) Adaptive Neuro Fuzzy Inference System (ANFIS) and researchers in the AI area. So far, some studies have been BELPM are not similar because of different structures, carried out to develop new neural networks by imitating some functions, and some aspects of learning algorithms. Due to the aspects of emotional learning. Hippocampus-neocortex and learning algorithm and the large number of learning amygdala hippocampus model have been proposed as neural parameters (linear and nonlinear) that are spread through the network models [49], [50]. They combine associative neural layers, ANFIS has the capability to obtain very accurate network with emotional learning concepts. Several emotion- results for complex applications. However, its learning based prediction models [6], [8]-[15] have been developed to algorithm has a significant effect on its computational model the complex systems. Most of them are based on the complexity and it also causes over-fitting problems. The curse amygdala-orbitofrontal subsystem [23] that was proposed by of dimensionality is another issue of ANFIS and increases the Moren and Balkenius. They have been applied for different computational time of ANFIS for high-dimension application. applications, e.g., auroral electrojec (AE) index prediction [9], Although the number of learning parameters of BELPM is not solar activity prediction [6], [8], [10], [11]-[13], [14], and dependent on the dimension of input data, as mentioned financial and chaotic time series prediction [6], [8]-[14]. before, BELPM uses Wk-NN; consequently, the d) Amygdala-orbitofrontal system computational time of BELPM only depends on the number of neighbors. To decrease its time complexity in high-dimension The weight, w j , is calculated as wjj K( d ), where K(.) is cases, we can choose the small number of neighbors for the known as the kernel function that makes the transition from BELPM. Euclidean distances to the weights. Any arbitrary function that 4) Local Linear Neuro Fuzzy Models (LLNF) and BELPM holds the following properties given can be considered as the can both be considered as types of “local modeling” [2] kernel function [56],[57 ]: algorithms. They both combine an optimization-learning 1) For all d , K( d ) 0 . algorithm and LSE to train the learning parameters. However, LLNF uses Local Linear Model Tree (LoLiMoT) algorithm, 2) If d0, then K( d ) gets the maximum value. instead of Wk-NN method of BELPM. The number of 3) If d , then K( d ) gets the minimum value. learning parameters of LoLiMoT has a linear relationship with Some typical kernel functions are the Gaussian kernel (2), the dimension of input samples and number of epochs; thus, Inversion kernel (3), and the weighted kernel function that is its computational complexity has no exponential growth for defined as (4). high-dimension applications. 1d2 5) Modular neural network is a combination of several K( d ) exp( ) 2 modules with different inputs [2] without any connection with 2 (2) others. There is no algorithm to update the learning parameters 1 K( d ) (3) of the modules. d 6) Hybrid structures that are defined in [1], differ from max(dd ) ( d min( )) BELPM in receiving the input data. The submodules of a K( d ) j (4) hybrid structure can also be designed in parallel or series. max(d ) As mentioned earlier, they transform the Euclidean III. BRAIN EMOTIONAL LEARNING-BASED PREDICTION MODEL distances into the weights; thus, the neighbors that are closer The architecture of BELPM is inspired by the brain to the test sample have higher weights on estimating the emotional learning system. It mimics some functional and output, r [56],[57]. structural aspects of the limbic system regions. To describe test BELPM’s architecture and its underlying learning algorithms, A. Architecture of BELPM we used the machine-learning terminology, instead of neuro- As shown in Fig. 2, the BELPM’s architecture consists of four scientific terms. For example, an input–output pair of training main parts that are named as TH, CX, AMYG, and ORBI, and data and an input–output pair of test data are equivalent to an are referred to as THalamus, sensory CorteX, AMYGdala, and unconditioned stimulus–response pair and a conditioned ORBItofrontal, respectively. stimulus–response pair in neuro-scientific terminology, ORBI p respectively. Thus, we used two subscripts u and c to CX o, j MO LO th AGG su,j uj, distinguish the training data set and the test data set that are e p a , j ro ,j AGG N N i defined as Ir{,}i c and Ir{,}i u , u, j c c, j c , j j 1 u u, j u , j j 1 BL p a , j r CM a , j Ouput Input p a , j MAX_MIN r thMAX_ MIN j (Unconditioned respectively, with N and N data samples. Before explaining (Unconditioned uj, c u Stimulus) AMYG response ) TH the architecture of BELPM, let us briefly review the W-kNN (a)During first learning Phase algorithm that is the basic of BELPM. The following steps explain how the output value of an input vector itest is ORBI CX po, j calculated by using the W-kNN algorithm [56]: s MO LO th AGG c,j cj,

e p a , j ro ,j AGG 1) Calculate the Euclidean distance as, d jii test u, j i 2 c,j BL p a , j r CM a , j Ouput Input p a , j for each i that is a member of the training data set MAX_MIN u , j thMAX_ MIN (Conditioned (Conditioned cj, Stimulus) AMYG response ) {}i, i , ..., i . TH u,1 u,2 u,Nu (a)During second learning Phase 2) Determine the k minimum values of d {d , d, ..., d } Fig.2. The architecture of BELPM showing the structure of each part and its 1 2 Nu connection to other parts. (a) An input from training set, unconditioned as dmin . The data samples corresponding to are shown as stimulus, enters the BELPM. (b) An input of test data, conditioned stimulus, k enters the BELPM. Iimin{ min, j ,r min, j } j 1 , where Imin denotes the samples of the training data set that are nearest neighbors to the test Let us assume an unseen input ic, jI c enters BELPM that sample, . provides the corresponding output using the following steps:

3) Calculate the output of as given in (1). 1) The input vector ic, j is fed to TH, which is the entrance kk part of the BELPM structure. This part consists of two rtest(/) w j r min, j w j (1) components: MAX_MIN (MAXimum_MINimum) and AGG j 1 j 1 (AGGregation). The MAX_MIN can be described as a da da,1 ,d a,2 ,...,d ). The output vector of the first layer modular neural network. It has two neural networks, each of a,Nu which has two layers with a competitive function for the 1 is na ; the output value of each node of this layer is calculated neuron of the first layer and a linear function for the neuron of the second layer. The output of MAX_MIN that is referred to using (7), where the input to m th node is . damin,m MAX_ MIN 1 as thc, j is calculated according to (5) and is fed to AGG na,m K( d a min,m ) (7) and AMYG. Equation (5) calculates the highest and lowest In general, the kernel function for th node can be defined as values of the input vector with R dimensions. The AGG can be (8), (9), and (10). The input and the parameter of are described as a neural network with R2 linear neurons ( R is th determined using dm and bm , which are the entity of d and the dimension of i ); the output of AGG, AGG , is equal to c, j thc, j b , respectively. We used the subscript a to distinguish BL’s

and is fed to CX as shown in (6). kernel function and its related parameters ( da min and ba ). MAX _ MIN K(d ) exp( d b ) (8) thc, j [ Max( ic, j ),Min( i c, j )] m m m (5) 1 K( d ) (9) m 2z (1 ( dmm b ) ) AGG th []i (6) max(dd ) ( d min( )) c, j cj, m (10) K( dm ) AGG max(d ) 2) The thc, j is sent to CX, which is a pre-trained neural The second layer is a normalized layer and has k nodes (fixed network with one layer of linear function. The role of CX is to a or circle), which are labeled as to calculate the normalized provide sc, j and distribute it between AMYG and ORBI. It value of as (11). should be noted that and have the same entity; however, (n1 ) they have been originated from different parts. n2 a,m (11) a,m ka MAX_ MIN n1 3) Both and thc, j are sent to AMYG that is the a,m m1 main part of the BELPM structure and is divided into two The third layer has circle nodes with functions given in components: BL(BasoLateral) and CM (CenteroMedial). The subpart that is referred to as BL corresponds to the set of the (12). This layer has two input vectors, and rua ; the latter is a lateral and basal, while the other subpart, CM, corresponds to vector that is extracted from r r ,r ,...,r and is the accessory basal and centromedial part. There is a u u,1 u,2 u,Nu bidirectional connection between AMYG and ORBI; this related to the nearest neighbors. connection is utilized to exchange the information, and contains AMYG’s expected punishment and ORBI’s response. 1 3 (na,m ) (12) The functionalities of AMYG have been defined to mimic nr a,mka ua,m 1 some roles of the amygdala (e.g., storing unconditioned na,m stimulus–response pairs, making the association between the m1 The fourth layer has a single node (circle) that calculates the conditioned and unconditioned stimuli, and generating reward summation of its input vector to produce r , the primary and punishment). Thus, AMYG has a main role in providing a, j the primary and final responses. The structure and function of output (response). The function of the third and fourth layers BL and CM are given as follows: can be formulated according to (13), i.e., the inner product of 3 a) BL, which is responsible for the provision of the primary the third layer’s output, na , and rua . response of the AMYG, calculates d s- s + th - th . Here, 3 a,i u,j c,j22 u,j c,j ra, jnr a ua (13) MAX_ MIN The provided primary response, , is sent to the CM to su , j and thu, j are the output of TH and CX for each member of the training data set {i , i ,.., i }, respectively. calculate the reinforcement signal. It should be noted that the u,1 u,2 u,Nu connection between CM and BL (see Fig. 2(a)) does not exist It must be noted that sc, j and su , j are the output of the CX for between centromedial nuclei and the other parts of the and i , respectively. The BL encompasses an adaptive amygdala. However, we assumed its existence in the u , j BELPM’s architecture. network with four layers (see Fig. 3(a)). The first layer consists of ka nodes (“adaptive or square” [3] nodes) b) CM is responsible to provide the final output; thus, it is an with K(.) function (kernel function). Each node has an input important component of AMYG. It has inputs from BL and ORBI, and performs different functions during the first that is an entity from the da mind a min,1 ,d a min,2 ,...,d leaning phase and the second learning phase of the BELPM. a min,ka

(which is a set of ka minimum distances of  CM during the first learning phase: The first learning phase of BELPM begins when the input of BELPM is chosen from the training sets, are do min (the ko minimum values of the distance iu,j Iu i u,1u,2, i ,..., i u,N (see Fig. 2(a)). After 1 u vector, d d ,d ,...,d ) and n , respectively. The receiving this input, BL starts performing its function o o,1 o,2 o,Nu o function of m th node is the kernel function given in (18). and provides r and r that are sent to CM, which a, j uj, has three square nodes (see Fig. 3(b)). The function n1 K(d ) (18) of the first node is defined according to (14) to o,m omin,m The second layer consists of nodes. Each node calculates an provide rj , i.e., the output of BELPM (emotional output as shown in (19) and sends it to the third layer. response). The functions of the second and third

nodes are defined according to (15) and (16), which 1 2 (no,m ) provide reinforcement (punishment), pa, j , and n (19) o,m ko e n1 expected reinforcement (expected punishment), p , o,m a,j m1 e respectively. The third layer has an input vector pa min , which is a vector of e rj w 1 r a, j w 2 r o, j w 3 (14) ko minimum values of pa corresponding to do min . The third

pa,j w a,1u,j r w a,2a,j r w a,3 (15) e 2 layer’s nodes have the function to multiply pa min and no , as e pa, j r u, j r a, j (16) given in (20). 32e A supervised learning algorithm determines appropriate values noo n p amin (20) for the weights. The fourth layer has a single node with a summation function  CM during the second learning phase: The second that provides the output of ORBI (the secondary response). learning phase of BELPM begins when the input is The result of the third and fourth layers can be obtained

chosen from the test data sets ic, j (see Fig. 2(b)). As according to (21). The output roj, is fed to the LO and CM. BL has no information about the desired output (the 2e target response), it does not send any information ro, jnp o amin (21) about the desired output to the CM. In this phase, the b) LO evaluates the output of MO, generates p as first node of the CM has the same input and performs o, j reinforcement (punishment), and sends it to MO. It has one the same function as it does in the first learning node (square) with a summation function given in (22). phase. However, the input and the connection of the (22) second square node are different. In this phase, the po, j w o,1 r o, j w o,2 reinforcement, p , is calculated according to (17), The main structure of BELPM is similar to the amygdala- a, j orbitofrontal model. However, the connection between the which differs from (15) in terms of its input. The components, their functions, and definition of reinforcement third node has no input and function, and it can be functions are significantly different from the amygdala- removed. orbitofrontal model. Moreover, in this study, we have used p w r w r w (17) a, j a,1 j a,2 a, j a,3 pre-trained neural networks to explain the functionality of TH e 4) The expected reinforcements, pa , and sc, j are sent to and CX; however, to increase the functionality and ORBI that is connected to CX and AMYG. The function of adaptability of BELPM, they can be defined by a multi-layer this part is defined to emulate certain roles of the orbitofrontal NN with trainable weights. cortex. These roles include forming a stimulus–reinforcement association, evaluating reinforcement, and providing an Adaptive network of AMYG output. Before explaining how ORBI performs, it should be rua,1 d 1 noted that ORBI starts performing its functions after receiving amin,1 n K(.) a,1 N × the vector of the expected reinforcement, , which means that rua,2 1 damin,2 n BL of AMYG must have fulfilled its functions. The ORBI is K(.) a,2 N × ra composed of MO and LO corresponding to the lateral and . . . ∑ . . . medial parts of the orbitofrontal cortex, and their functions are . . r ua, k . d n1 described as follows: amin,k K(.) a,k N × a)MO, which is responsible for the provision of the secondary response of the BELPM, receives and calculates (a)

d ssfor each su , j . The MO consists of a four- o, ju, j c, j 2 layer adaptive network. The first layer has ko nodes (square), and the input vector and output vector of the first layer’s nodes bit 1 b itη it b it (24) o oo o

The parameter it denotes the current values of learning parameters, where b it and b it are the gradients of loss a o functions to the parameters and (25) and (26). Two

learning rates η it and η it are defined as functions of and a o .

f (pa ) ba (25) ba

(b) (26)

Adaptive network of where and po are defined using (27) and (28). ORBI e pa wa,1 r u,a w a,2 r a w a,3 (27) pamin,1 d 1 omin,1 n o,1 N K(.) × proww o,1 o o,2 (28) pe 1 amin,2 The ORBI and AMYG have their own loss function and d no,2 omin,2 K(.) N × r update their own learning parameters and separately. The ∑ o . . . derivatives of (25) and (26) are formulated by using the chain . . e . pamin,k rules of (29) and (30). Here, we have ignored and just . n 1 . . d o,k omin,k mentioned about the chain rules, but in the algorithm, it has K(.) N × × been calculated for the current iteration, . (c) pnr 1 aau,a wa,2rd a k amin (29) Fig. 3. (a). The adaptive network of BL. (b). The adaptive networks of AMYG a 1 ba 2 K and ORB and CM. (c). The adaptive network of ORBI. ( na,m ) m1 B. Learning Algorithms of BELPM 1 pnooru,o In the following, we explain how the BELPM uses the wo,2rd o k omin (30) o 1 bo 2 K combination of two learning methods: the SD [3] and LSE to ( no,m ) m1 learn the input–output mapping (the stimulus–response An offline version of LSE is used to update the linear association). The learning parameters are the weights parameters under the assumption that the nonlinear parameters ( w1 ,w 2 ,w 3 , etc.) and the parameters of kernel functions have been updated and their values are fixed. The output of

( bo and ba ). As mentioned earlier, the learning algorithm of BELPM, rj , is parameterized by the weights,{ w1,, w 2 w 3 }. BELPM is divided into two phases: the first learning phase Furthermore, p and p have been formulated using the and the second learning phase. Each of them uses different a, j o,j learning rules to adjust the learning parameters. linear parameters, wa,1,, w a,2 w a,3 , wo,1 , and wo,2 , according to (27) and (28). The LSE updates the weights , 1) First learning phase: At the first learning phase, a hybrid , and by assuming that each triple of the learning algorithm [3] that is a combination of SD and LSE is set {( r ,r ,ru, j ), j 1,...N u }is substituted into (14). used to update the learning parameters of AMYG (e.g., ba , a ,j o,j

wwa,1, a,2 and w1 ,w 2 ,w 3 ) and ORBI ( bo,w o,1 ,w o,2 ). Under the Each triple of {( r ,r , p ), j 1,...N } and each pair of assumption that the linear a ,j u, j a, j u parameters, w,w,w,w1 2 3 o,1 ,w o,2 ,w a,1 ,w a,2 ,w a,3 have fixed values, the (r ,p ),i 1,...N are also substituted into (15) and o,j o, j u nonlinear parameters (kernel parameters), and , are (22), respectively; thus, N linear equations such as (31), (32), updated by SD in a batch-learning manner. Thus, SD is u applied to minimize the two loss functions, which have been and (33) are derived. defined on the basis of p and p . Equations (23) and (24) a o r w r w r w (31) are SD-based learning rules used to calculate the derivatives of u,j 1a,j 2o,j 3 the loss functions, with respect to and . pa, j r u,i w a,1 r a,i w a,2 w a,3 (32) bit 1 b itη it b it (23) p r w w (33) a aa a o, j o, j o,1 o,2

Equations (31), (32), and (33) can be rewritten in matrix form N Nu Nu normalized mean square error (NMSE) and mean square error as A r r 1 u , B r r 1 and C r1 to a, j o, j j1 u, j a, j j1 o, j j1 (MSE), as given in (39), (40), to assess the performance of the define (34), (35), and (36), and update the linear parameters prediction models and provide results comparable with other using LSE. Here, the weights are defined as w [w1,w2 ,w3 ], studies. N wa [wa,1,wa,2,wa,3] and wo [wo,1,wo,2]. 2 , (yjjˆ y ) j1 NMSE N (39) T 1 T 2 w (A A) A ru (34) (yjj y ) T 1 T e j1 waa (B B) B p (35) T 1 T e 1 N w (C C) C p (36) 2 o o MSE (yjjˆ y ) (40) During the first learning phase, the learning parameters, linear N j1 and nonlinear, can be updated by using one of the following Where ˆy and y refer to the observed values and desired methods: targets, respectively. The parameter y is the average of the  All parameters can be updated using SD. desired targets. For all experiments, one-fold cross-validation  The nonlinear parameters can be updated using SD and was chosen; the number of samples in one-fold cross- the initial values of linear parameters can be adjusted validation was equal to the size of the test data set. using LSE.  The linear parameters are updated using LSE and the A. Lorenz Time Series initial values of parameters of kernel functions are The Lorenz time series [5], [9], [58] was chosen as the first chosen by using a heuristic method. test, given by using (41) and (42). In this case study, the  The nonlinear parameters are updated using SD and initialized point is according to x(0) 15,y(0) 0,z(0) 0 . LSE are applied to update the linear parameters. . Certainly, these methods differ in terms of time complexity x a(y x) and prediction accuracy, and a tradeoff between high accuracy . and low computational time must be considered to choose a y bx y xz (41) feasible method. The batch mode or online mode of each . z xy cz method can be considered for the first learning phase. 2) Second learning phase: At the second learning phase, the a 10,b 28,c 8 / 3,T 0.01s (42) nonlinear and kernel parameters are updated by a To produce the time series, the sampling period is equal to reinforcement-learning algorithm. The nonlinear parameters 0.01 s [5], [9], [58], and the embedded dimension is selected are updated by SD. The SD algorithm minimizes the loss as three. The BELPM is tested for four sets of data from the nd functions, which are defined based on the Lorenz time series. The first data set is selected from 32 to 51st s, and is employed for long-term prediction. For the reinforcements, p and p , using (37) and (28). It must be a, j o,j training set, 500 samples are used and the next 1400 samples noted that pa is calculated using the obtained output of are considered as the test data set and validation data set. BELPM, r , as given in (37), which differs from that Table I presents the NMSEs obtained from applying different calculated using (27). The adjusting rules that update b and methods for 10, 30, and 40 steps ahead prediction of this data o set. It also indicates that the NMSE of ANFIS is lower than ba are calculated according to (21) and (22). The derivatives the NMSE of BELPM for 10 steps ahead prediction. However, of the loss functions with respect to and are calculated when the prediction horizon increases to 30 and 40 steps according to (30) and (38). Here, the weights w ,w ,w are ahead, the NMSEs of BELPM are lower than ANFIS. The a,4 a,5 a,6 presented results in Table I also show that an increase in the equal to [1,-1,0]. prediction horizon causes a decrease in the prediction accuracy for all methods. It is important to note that the

paawa,4 r w a,5 r w a,6 (37) NMSEs of BELPM for predicting 30 and 40 steps ahead are less than those of the other methods. For the second data set, ANFIS, Wk-NN, and BELPM are p r r n1 aau,a tested for 1–20 steps ahead prediction, using 1500 samples as wwa,45 a,rd ak amin (38) bbaa 2 K the training data set and 1000 samples as the test data set. In ( na,m ) m1 Fig. 4, the NMSE values vs. the prediction steps are depicted; it can be observed that the NMSE values of BELPM are lower than those of the other methods, especially, in the case of 15 IV. CASE STUDIES: CHAOTIC TIME SERIES steps ahead prediction and longer. In this section, BELPM is evaluated as a prediction model by using two benchmark chaotic time series, Lorenz and Henon. To provide a careful comparison with other methods, TABLE I THE COMPARISONS OF NMSES OF DIFFERENT METHODS TO PREDICT MULTI- we used various data sets with different initialized points and STEP AHEAD OF LOREN TIME SERIES (THE FIRST DATA SET) sizes of training samples. We also utilized two error measures: Learning Ten Thirty Forty Learning NMSE TIME(Sec) STRUCTURE Method Method BELPM 0.0125 0.2473 0.2447 BELPM 0.0325 11.81 10neuron ANFIS 0.006 0.3559 0.3593 ANFIS 0.0802 12.91 4 rule RBF 0.4867 [58] 0.3405 0.6887 LoLiMoT 0.2059 18.74 7 neuron LLNF 0.1682 [58] 0.4946 0.5341 RBF 0.1193 26.07 32 neuron Wk-NN 0.0235 0.2599 0.3830 Wk-NN 0.0342 1.23 5 neighbor

0.025 As mentioned earlier, the Lorenz chaotic time series is a well- BELPM ANFIS known benchmark time series and has been tested with 0.02 WKNN numerous data-driven models to evaluate the models’ performance. Table IV presents the obtained NMSEs of several data-driven methods for noiseless and noisy data. The 0.015 data-driven models are: Nonlinear Autoregressive model with

eXogenous input (Hybrid NARX-Elman RNN) [59], Evolving NMSE 0.01 Recurrent Neural Networks (ERNN) [60], Radial Basis Function (RBF), multilayer perceptron (MLP) [5], Support Vector Regression (SVR), Tapped Delay Line Multilayer 0.005 Perceptron (TDL-MLP), Distributed Local Experts based on Vector_Quantization using Information Theoretic learning 0 0 2 4 6 8 10 12 14 16 18 20 (DLE-VQIT) [61], Cooperative Coevolution of Elman Steps Ahead Recurrent Neural Networks (CCRNN) [62], Functional Fig.4. The NMSE values of multi-step ahead prediction of Lorenz system vs. Weights Wavelet Neural Network-based state-dependent prediction horizon for 1000 samples of test data using 1500 samples as AutoRegressive (FWWNN-AR) [63], Recurrent Neural training data (the second data set). The dotted curves show the NMSE of BELPM. Network trained with Real-time Recurrent Learning (RNN- In the third data set, ANFIS, Wk-NN, and BELPM are RTRL), Recurrent Neural Network trained with the second- compared for 30 steps ahead prediction, with different sizes of order Extended Kalman Filter (RNN-EKF), Recurrent Neural training data set. In this case, the complete sample set is Network trained with the algorithm and BackPropagation th th Through Time (BPTT), feedforward Multi layer Perceptron chosen from 30 to 55 s, and the training is conducted trained with the Bayesian Levenberg–Marquardt (MLP-BLM), using 500, 1000, and 1500 samples to predict 1000 samples of th th and recursive second-order training of Recurrent Neural test data set from 45 to 55 s. The validation set is 1000 Networks via a Recursive Bayesian Levenberg–Marquardt th th samples from 56 to 65 s. Table II compares the values of (RBLM-RNN) algorithm [65]. It can be noted that the table is NMSE for the different methods and once again shows that sorted according to the obtained NMSEs, and that the NMSE NMSE of BELPM has the lowest value among that of all the of BELPM is not excellent as that of the other methods. methods tested. However, this table compares the NMSEs for short-term For the fourth data set, the training set and test data are prediction, which is not a feasible application for BELPM. picked from 15th to 45th s and 46th to 55th s, respectively. B. Henon Time Series BELPM is applied for 25 steps ahead prediction and the The second benchmark of this study is the Henon time NMSE values, CPU time, and structures of different models series that is constructed by using (43). are compared in Table III. Again, it can be observed that the 2 NMSE of BELPM is lower than that of the other methods. The x(t 1) 1 ax(t) y(t) time complexity of BELPM is less than most of the other y(t 1) bx(t) (43) methods, and only Wk-NN is found to have a little lower time complexity than BELPM. a 1.4,b 0.3 The embedded dimension are considered as 0.01s [5], [58] and three, respectively. In this case study, the initialized point TABLE II is x(0) 0,y(0) 0 . Three data sets of the Henon time series THE COMPARISONS OF NMSES OF DIFFERENT METHODS FOR THIRTY STEPS are used for evaluating the BELPM. The first data set is AHEAD PREDICTION OF LORENZ TIME SERIES USING DIFFERENT SIZES OF th th TRAINING SAMPLES. selected from 9 to 18 s and the training data set and the test Learning 500/1000 1000/1000 1500/1000 data set consist of 800 and 100 samples, respectively. The Method BELPM, ANFIS, and Wk-NN are tested to predict three steps BELPM 0.1711 0.0108 0.0280 ahead of this data set. ANFIS 0.4199 0.1002 0.0947 Wk-NN 0.1951 0.1518 0.0425 TABLE IV THE COMPARISONS OF NMSES OF DIFFERENT METHODS FOR LORENZ TIME SERIES TABLE III Learning NMSE No of Fundamental Method and THE COMPARISONS OF NMSES, CPU TIME AND STRUCTURE OF DIFFERENT Method Training steps ahead METHODS TO PREDICT 25 STEPS AHEAD USING 3000 TRAINING DATA SAMPLES and Test data 0.6 samples ANFIS 0.5 BELPM

FWWNN[63] 9.8e-15 1500,1000 NN+AR, 1 step noiseless 0.4 NARX[59] 1.9e-10 1500,1000 AR, 1 step noiseless 0.3 BELRFS[12] 4.9e-10 1500,1000 NF+BEL, 1 step noiseless ERNN[60] 9.9e-10 1500,1000 NN, 1 step noiseless 0.2 RBF[5] 1.4e-9 1500,1000 NN, 1 step noiseless 0.1

MLP[5] 5.2e-8 1500,1000 NN, 1 step noiseless 0

BELPM 2.9e-6 1500,1000 W-kNN+BEL,1step noiseless -0.1

TDL-MLP[61] 1.6e-4 ----- NN, 1 step noiseless Steps Three Prediction Ahead Error DLE-VQIT[61] 2.6e-4 ------, 1 step noiseless -0.2 -0.3 LSSVMs[64] 6.4e-5 1000,250 NF, 1 step noisy-STD 0.05 0 10 20 30 40 50 60 70 80 90 100 BELPM 3.7e-4 1500,1000 W-kNN+BEL,1stepnoisy- TIME(Sec) STD 0.001 Fig. 5. The obtained error of three steps ahead predictions of the Henon time BELPM 4.4e-4 1500,1000 W-kNN+BEL,1stepnoisy- series by BELPM and ANFIS. STD 0.01 TABLE V RBLM- 9.0e-4 1000,250 RNN,1 step noisy-STD 0.05 THE COMPARISONS OF NMSES, CPU TIME AND STRUCTURE OF DIFFERENT RNN[65] METHODS TO PREDICT THREE STEPS AHEAD OF HENON TIME SERIES CCRNN [62] 7.7e-4 500,500 NN,2 step noiseless (FIRST DATA SET OF HENON) LLNF[64] 2.9e-4 1000,250 RNN, 1 step noisy-STD 0.05 Learning NMSE Structure CPU Time(Sec) MLP_BLM[65] 8.1e-4 1000,250 NN, 1 step noisy-STD 0.05 MLP_EKF[65] 1.6e-3 1000,250 NN, 1 step noisy-STD 0.05 Method RNN- 1.7e-3 1000,250 NN, 1 step noisy-STD 0.05 RTRL[65] BELPM 0.0065 27neuron 0.3315 RNN-BPTT[65] 1.8e-3 1000,250 NN, 1 step noisy-STD 0.05 RBF[5] 0.0872 42 neuron ------RNN-EKF[65] 1.2e-3 1000,250 RNN,1 step noisy-STD 0.05 LoLiMoT[5] 0.0291 20 neuron ------BELPM 1.0e-3 1500,1000 W-kNN+BEL,1stepnoisy- ANFIS 0.0232 25rule 153.3 STD 0.1 WKNN 0.0107 5 neighbor 0.0725

0.4 BELPM Table V presents the NMSEs, structure, and computational ANFIS 0.35 WKNN time (CPU time) for the tested methods. It can be noted that 0.3

the NMSE of BELPM is lower than that of the other methods. 0.25

This table also indicates that the number of neurons in 0.2 NMSE Error NMSE Error BELPM is more than the number of rules in ANFIS; however, 0.15 the CPU time for BELPM is lesser than that of ANFIS. The 0.1 graph in Fig. 5 displays the obtained errors (the difference 0.05

between the target values and obtained outputs) from ANFIS 0 4 9 16 25 and BELPM for the test data set. We have also compared the Number of Neurons OR Rules effect of different structures of W-kNN, ANFIS, and BELPM Fig. 6. The comparison of NMSEs of three different structures obtained using the BELPM, ANFIS, and W-kNN. These methods have been used for on their prediction accuracy. The structures of these methods different structures. have been changed by increasing the number of neighbors in TABLE VI BELPM and Wk-NN and the number of rules in ANFIS. As THE COMPARISONS OF NMSES OF DIFFERENT METHODS TO PREDICT MULTI- Fig. 6 shows the value of NMSE for ANFIS decreases when STEPS AHEAD OF HENON TIME SERIES the number of rules increases. In contrast, increasing the Learning 1Step-ahead 2Step-ahead 3Step-ahead number of neighbors of Wk-NN increases the value of NMSE. Method For BELPM, the value of NMSE decreases slowly when the number of neighbors increases. This experiment verifies the BELPM(SLP) 0.0063 0.0251 0.1436 BELPM(FLP) 0.0066 0.0274 0.1576 generalization capability of BELPM and shows that different ANFIS 0.0051 0.0244 0.2539 structures of BELPM do not make a noticeable difference LoLiMoT 0.0590[5] 0.2475 0.7048 between the obtained prediction errors. To further evaluate the WKNN 0.0078 0.0479 0.1897 performance of the BELPM and verify its robustness, white The graph in Fig. 7 shows the values of NMSE of two steps noise with standard deviation 0.1 is added to the first data set, ahead prediction of the first data set during the learning and the results of multi-step ahead prediction are listed in phases. It can be observed that the NMSE decreases Table V. The effect of using the second learning phase of continuously during the first and second learning phases. The BELPM has been presented in Table VI. It is clear that there is graph in Fig. 8 depicts the mean square error (MSE) that is a reduction in the obtained NMSEs of BELPM because of the obtained from the training samples and test samples during the second learning phase (SLP). first learning phase. In each epoch, the learning parameters have been updated and the values of MSE for the training data and the test data have been calculated. It is noticeable that the values of the dotted curve that is related to the MSE of the test data are lower than those of the solid curve, i.e., the MSE of training data. BELPM is higher than that of Wk-NN and other methods. prediction error of noisy data (First and second learning phase) 0.0268 Errors of second learning phase TABLE VII Errors of first learning phase THE COMPARISONS OF NMSES OF DIFFERENT METHODS TO PREDICT 100 DATA 0.0266 SAMPLES USING DIFFERENT NUMBER OF NUMBER OF TRAINING SAMPLES Method 900/100 600/100 300/100 0.0264

0.0262 BELPM 0.0067 0.0102 0.0269 ANFIS 0.0301 0.0406 0.0538

0.026 WKNN 0.0107 0.0118 0.0468 NMSE error indexes NMSE error RBF 0.0155 0.0307 0.0457

0.0258

0.0256 0 5 10 15 20 25 30 35 40 45 0.14 epochs ANFIS 0.12 BELPM Fig.7. The NMSE of BELPM to predict the two steps ahead of Henon time WKNN that is added white noise with standard deviation 0.1. The solid line is related 0.1 to the NMSE during 35 epochs of the first learning phase and the dashed line is the NMSE during 10 epochs of the second learning phase. 0.08

0.06 -5 x 10

1.2 NMSE error indexes NMSE error Test 0.04 Train 1 0.02

0.8 0 100 300 600 900 Numberof training data 0.6 Fig. 9. The NMSE of different methods for three steps ahead prediction of Henon time series using different sizes of training data (100, 300, 600, and Mean Square Error Square Mean 0.4 900) for 100 samples of test data.

0.2 TABLE VIII

0 COMPARISON OF DIFFERENT METHODS TO PREDICT THREE STEP AHEAD 0 20 40 60 80 100 epochs OF HENON TIME SERIES FOR THIRD SET OF DATA Fig.8. Curves of the mean square error during the first learning phase; the Learning NMSE TIME(Sec) STRUCTURE solid line is the MSE curve, which is related to training data; the dashed line is Method the MSE curve, which is related to test data. BELPM 0.0077 5.72 5neuron

The second data set of the Henon time series is selected ANFIS 0.0514 627.56 16 rules LLNF 0.2059 18.74 7 neuron from 8th to 18th s, and three different sizes of data samples are RBF 0.1193 26.07 32 neuron selected as training data. Table VII compares the obtained WKNN 0.0083 1.128 3 neighbor results from examining different methods that are trained by different training sets: 300, 600, and 900 training samples, and have been used to predict 100 samples of test data. As shown Figure 10 displays the obtained errors from examining in Table VII, the prediction accuracy of BELPM is higher than BELPM and ANFIS for three steps ahead prediction. The that of the other methods in all cases; thus, it is a suitable figure shows that BELPM is more accurate than ANFIS. method for the prediction application with a small number of Figure 11 depicts how the NMSE index of BELPM decreases training samples. The bar chart in Fig. 9 shows the NMSE for over its learning phases. It becomes obvious that for both the compared methods. It can be seen that the values of NMSE training and test data, there is a continuous decrease in the of BELPM are once again lowest among those of all the NMSE index during the epochs. For this experiment, we have presented methods. Examination of this data set revealed that applied BELPM for the third data set and have chosen 200 a decrease in the number of training samples causes an epochs for both first and second learning phases. This figure increase in the values of NMSE of all methods (e.g., BELPM, verifies the ability of BELPM to adjust the learning ANFIS, Wk-NN, and RBF). However, the rate of increasing parameters in the online mode. the values of NMSE of BELPM is lower than that of other methods. The third data set of the Henon time series is 0.8 ANFIS BELPM selected to compare the time complexities (CPU time) of 0.6 various learning methods when the size of the training set is 0.4 th large. A training set of 3000 samples are selected from 100 to 0.2 th th th

130 s, and the following 1000 samples from 130 to 140 s 0 ERROR are considered as the test data set. Table VIII summarizes the -0.2 NMSEs, CPU time, and the applied structures of the different -0.4 methods. Although the CPU time of Wk-NN is 1.128 s, which -0.6

-0.8 is less than that of BELPM, the prediction accuracy of 0 100 200 300 400 500 600 700 800 900 1000 TIME (sec) Fig.10. The prediction errors of ANFIS and BELPM for three steps ahead of two benchmark chaotic time series, Lorenz and Henon, and prediction of Henon time series using 3000 training samples. the results strongly indicate that the model can predict the

Error of training data Error of Test data long-term state of chaotic time series rapidly and more -3 -3 x 10 (First learning phase) x 10 (First learning phase) 1.984 7.845 accurately than other well-known methods, i.e., RBF,

7.84 LoLiMoT, and ANFIS. The results also show that BELPM is 1.983 more efficient than the other methods when large training data 7.835 sets are not available. 1.982 7.83 In comparison with other data-driven approaches, we can 7.825 summarize the highlighted features of BELPM as follows: 1.981 7.82 1) It converges very fast into its optimal structure (see Tables

1.98 7.815 III, IV, and VII). NMSE error indexes NMSE error NMSE error indexes NMSE error 2) It has high generalization capability (see dotted lines in 7.81 1.979 Figs. 7 and 8), high robustness from the perspective of 7.805 learning algorithms (see solid lines in Figs. 7, 8, 11, and 12), 1.978 7.8 low noise sensitivity (see Table V), relatively low 0 50 100 150 200 0 50 100 150 200 epochs epochs computational time (compare the values of time columns in (b) Fig.11. The NMSE of three steps ahead predictions during first learning Tables III, IV, and VII), and low model complexity. phase. (a) For training data. (b) For test data. 3)The number of neighbors in AMYG and ORBI is not dependent on the input’s dimension, which indicates that the The Henon chaotic time series is also a well-known dimension of data does not have a direct effect on the model benchmark time series, and has been examined by a large complexity. However, the input’s dimension increases the number of data-driven models. Table IX lists the obtained computational time of calculating the Euclidean distance and NMSEs of several data-driven methods. It becomes obvious searching the nearest neighbors; thus, it has an indirect effect that BELPM has a fairly good performance in predicting noisy on the computational time of BELPM. The experiments have data time series. also shown that when there are a limited number of samples for training the model, the number of neighbours, k and TABLE IX o THE NMSE VALUES OF DIFFERENT METHODS TO PREDICT HENON TIME SERIES ka should be raised to get accurate results (compare Table IV Learning NMSE No of Fundamental Method and Method Training time series and VII). A general observation is that a better prediction and Test result could be achieved when the number of neighbours, k , data o samples in the ORBI part is approximately twice the number of RBF[5] 1.4e-9 1500,1000 NN, 1 step noiseless neighbours, , in the AMYG part. The feasible values and MLP[5] 5.2e-8 1500,1000 NN, 1 step noiseless TDL-MLP[61] 1.6e-4 ----- NN, 1 step noiseless effective combination of k and k can be determined using DLE-VQIT[61] 2.6e-4 ------, 1 step noiseless a o BELPM 3.3e-4 800,100 W-kNN+BEL, 1 step noisy- meta-heuristic optimization methods (e.g., genetic algorithm). STD 0.001 4) The second learning phase of BELPM provides online LSSVMs[64] 4.4e-4 1000,250 NF, 1 step noisy-STD 0.05 adaptation and continuously increases the prediction accuracy RBLM- 6.8e-4 1000,250 RNN,1 step noisy-STD 0.05 RNN[65] and offers the capability to overcome both over-fitting and RNN-EKF[65] 8.6e-4 1000,250 RNN,1 step noisy-STD 0.05 under-fitting problems. LLNF 7.7e-4 1000,250 RNN, 1 step noisy-STD 0.05 5) For long-term prediction using a small number of training MLP-BLM[65] 8.1e-4 1000,250 NN, 1 step noisy-STD 0.05 samples, the accuracy of BELPM is higher than other data- BELPM 6.4e-3 800,100 W-kNN+BEL ,1 step noisy- STD 0.01 driven models, such as ANFIS and LoLiMoT (see Tables I, II, RNN-BPTT[65] 1.1e-3 1000,250 NN, 1 step noisy-STD 0.05 IV, and V). In the case of using a large number of training RNN- 1.0e-3 1000,250 NN, 1 step noisy-STD 0.05 samples, it is also noticeable that the computational time of RTRL[65] BELPM is not greater than that of ANFIS and LoLiMoT (see

Tables III and VII). It can also be concluded that when the degree of chaos is high, even using a large number of data and

neuro-fuzzy methods (ANFIS and LoLiMoT) would not achieve higher prediction accuracy than the BELPM (see V. DISCUSSION AND CONCLUSION Table VII). This study has presented a prediction model inspired by As future works, the authors consider adding some brain emotional processing and, particularly, has investigated optimization methods (e.g., genetic algorithm) to find optimal this model for chaotic time series prediction. We have values of the fiddle parameters, e.g., the number of neighbors described the architecture of this model using feedforward and ko and the initial values of nonlinear parameters. Other neural networks and adaptive networks. Furthermore, we have improvements in the model would be made on the basis of kd- also explained the function and learning algorithms of the Tree data structure [47] to address “the curse of model that is referred to as BELPM. The accuracy of the dimensionality” [1] problem and decrease the computational BELPM has been extensively evaluated by different data sets time complexity of BELPM. To adjust the nonlinear parameters, different types of optimization methods (e.g., [21] M.A.Arbib, The Handbook of Brain Theory and Neural Networks. 2nd Quasi-Newton or Conjugate Directions) for ORBI and AMYG edition, Cambridge, MA: The MIT Press, 2002. [22] J.Moren, C.Balkenius,“A computational model of emotional learning in can be utilized. In addition, the Temporal Difference (TD) the amygdala,’’, in From Animals to Animats, MIT, Cambridge, 2000. learning algorithm can also be used as a reinforcement method [23] M. S. Gazzaniga, R. B, Ivry, G.R.Mangun, and Megan.S.Steven, for the second learning phase to update the linear learning Gognative Nerosc in The Biology of the Mind. W.W.Norton&Company, rd parameters. The good results obtained by employing the New York, 3 ed., 2009. [24] D.Reisbetg, R.College., Cognetion: Exploring the science of the Mind, BELPM for predicting the chaotic time series are a motivation Newyork, 4th edition, W. W. Norton & Company, New York 2006. for applying this model as a classification method as well as to [25] J.M.Jenkins, K.Oatley, N.L.Stein, Human : A READER, identify complex systems. Blockwell publisher, U.K., 1998. [26] D. H. Hubel, M. S. Livingstone, “Color and Contrast Sensitivity in the Lateral Geniculate Body and Primary Visual Cortex of the Macaque References Monkey,” J., Neuroscience. vol. 10, no.7, pp. 2223-2237, 1990. [1] O. Nelles, Identification:From classicical Approches [27] J. P. Kelly, “The Neural Basis of Perception and Movement, Principles to Neural Networks and Fuzzy Models. Berlin, Germany: Springer- of Neural Science,” London: Prentice Hall. 1991. Verlag, 2001. [28] K. Amunts., O. Kedo., M. Kindler., P. Pieperhoff., H. Mohlberg., N. [2] S. Haykin, Neural Networks: A Comperhensive Foundation.Upper Shah., U. Habel., F. Schneider., K. Zilles., “Cytoarchitectonic mapping Saddle River, NJ:Prentice Hall, 2nd ed., 1999. of the human amygdala, hippocampal region and entorhinal cortex : [3] R. Jang, C. Sun and E. Mizutani, Neuro-Fuzzy and Soft Computing: A intersubject variability and probability maps,’’ J. Anatomy and computational approach to Learning and Machine Intelligence.Upper Embryology., vol. 21, no. 5-6, pp. 343-352, 2005. Saddle River, NJ: Prentice Hall, 1997. [29] C. I. Hooker., L. T. Germine., R. T. Knight., M. D. Esposito., [4] R. J. Frank, N. Davey and S. P. Hunt, “Time Series Prediction and “Amygdala Response to Facial Expressions Reflects Emotional Neural Networks,” J. Intell Robot Syst., vol. 31, no. 1-3, pp. 91-103, Learning,’’ Neuroscience. J., vol. 26, no.35, pp. 8915-8930, Aug. 2006. 2001. [30] B. Ferry., B. Roozendaal., J. McGaugh., “Role of norepinephrine in [5] A. Golipour, B. N.Araabi., C. Lucas, “Predicting Chaotic Time Series mediating stress hormone regulation of long-term memory storage: a Using Neuraland Neurofuzzy Models A Comparative Study,’’ J. critical involvement of the amygdala,’’, J. Biol Psychiatry., vol. 46, no. Neural. Process Lett., vol. 24, no. 3, pp. 217-239, 2006. 9, pp. 1140-1152, 1999. [6] M. Parsapoor, M, U. Bilstrup, "Neuro-fuzzy models, BELRFS and [31] E. R. Kandel., J. H. Schwartz., T. M. Jessell.: Principles Of Neural th LoLiMoT, for prediction of chaotic time series," in Proc. IEEE Int. Science. 4 edition, McGraw-Hill Medical, 2003. Conf. INISTA., pp.1-5, 2012. [32] AC. Robert., “Primate Orbitofrontal Cortex and Adaptive Behavior,’’, [7] C. Lucas, D. Shahmirzadi, N. Sheikholeslami, “Introducing BELBIC: J. TRENDS. COGN. SCI, vol. 10, no. 2, pp. 83-90, 2006. brain emotional learning based intelligent controller,’’ J. INTELL. [33] M. L. Kringelbach., “The orbitofrontal cortex: linking reward to hedonic AUTOM. SOFT. COMPUT., vol. 10, no. 1, pp. 11-22, 2004. experience,’’ J., Nat. Rev. Neurosci., vol. 6, pp. 691-702,2005. [8] C. Lucas, A. Abbaspour, A. Gholipour, B. Nadjar Araabi, M. [34] A. G. Phillips., “The Brain and Emotion by Edmund T. Rolls,’’ J. Fatourechi, “Enhancing the performance of neurofuzzy predictors by TRENDS. COGN. SCI., vol. 3, pp. 281-282, 1999. emotional learning algorithm,’’ J. Informatica (Slovenia)., vol. 27, no. 2 [35] C. Cavada., W. Schultz.,“The Mysterios of Orbitofrontal Cortex. pp.165–174, 2003. Foreword. Cereb Cortex,’’, J. Cerebr. Cortex. vol. 10, no. 3, pp .205, [9] T. Babaie, R. Karimizandi, C. Lucas, “Learning based brain emotional 2000. intelligence as a new aspect for development of an alarm system,’’ J. [36] C. A. Winstanley, D. E. H. Theobald, R. N. Cardinal, and T. W. Soft Computing., vol. 9, issue 9, pp.857-873, 2008. Robbins, “Constracting Roles of Basolateral Amygdala and [10] M. Parsapoor, C. Lucas and S. Setayeshi, "Reinforcement _recurrent Orbitofrontal Cortex in Impulsive Choice,” J. Neurosci., vol. 24, no. 20, fuzzy rule based system based on brain emotional learning structure to pp. 4718-4722, 2004. predict the complexity dynamic system," in Proc. IEEE Int. Conf. [37] M. L. Kringelbach, and E. T. Rolls, "The functional neuroanatomy of ICDIM, , pp.25-32, 2008. the human orbitofrontal cortex: evidence from neuroimaging and [11] M. Parsapoor, U. Bilstrup, "Brain Emotional Learning Based Fuzzy neuropsychology," J., Prog. Neurobiol, vol. 72, pp. 341–372, 2004. Inference System (BELFIS) for Solar Activity Forecasting," in Proc. [38] Damas, B. D. and Custódio, L.: "Emotion-Based Decision and Learning IEEE Int. Conf. ICTAI 2012, 2012. Using Associative Memory and Statistical Estimation," J. Informatica [12] M. Parsapoor and U. Bilstrup, “Chaotic Time Series Prediction Using (Slovenia), vol. 27, no. 2, pp. 145-156, 2004. Brain Emotional Learning Based Recurrent Fuzzy System (BELRFS),” [39] J. D. Velásquez., “When Robots Weep: Emotional Memories and to be published in International Journal of Reasoning-based Intelligent Decision-Making,” in Proc. Conf. on Artifitial Intelligence, pp.70-75. Systems, 2013. 1997. [13] M. Parsapoor, “Prediction the price of Virtual Supply Chain [40] S.H. Zadeh, S. B. Shouraki, and R. Halavati, R. “Emotional behaviour: Management with using emotional methods,’’ M.S. thesis, Dept. A resourcemanagement approach,” J. Adaptive Behaviour, vol. 14, pp. Computer. Eng., Science and research Branch, IAU., Tehran, Iran,2008. 357-380, 2006. [14] M. Abdollahzade, A. Miranian, and S. Faraji, “Application of emotional [41] M. Maçãs and L. Custódio, "Multiple Emotion-Based Agents Using an learning fuzzy inference systems and locally linear neuro-fuzzy models Extension of DARE Architecture," J. Informatica (Slovenia), vol. 27, for prediction and simulation in dynamic systems,” in Proc. IEEE Int. no. 2, pp. 185-196, 2004. Joint Conf. Fuzzy Syst., pp. 1–8,2012 [42] E. Daryabeigi, G. R. A. Markadeh, C. Lucas,"Emotional controller [15] E. Lotfi, M. R. Akbarzadeh-T, "Supervised brain emotional learning," in (BELBIC) for electric drives — A review," in Proc. Annual Conference Proc. IJCNN, pp.1-6, 2012. on IEEE Industrial Electronics, pp.2901-2907, 2010,. [16] H,A. Zarchi, E. Daryabeigi, G. R. A .Markadeh, J. Soltani, "Emotional [43] N. Sheikholeslami, D. Shahmirzadi, E. Semsar, C. Lucas., "Applying controller (BELBIC) based DTC for encoderless Synchronous Brain Emotional Learning Algorithm for Multivariable Control of Reluctance Motor drives," in Proc. Int. Conf. PEDSTC, pp.478-483 HVAC Systems,", J. INTELL. FUZZY. SYST.vol.16, pp. 1–12, 2005. 2011. [44] A. R. Mehrabian, C. Lucas, J. Roshanian,"Aerospace Launch Vehicle [17] J.M.Fellous, J. L.Armony, and J.E. LeDoux, “Emotional Circuits and Control: An Intelligent Adaptive Approach", J. Aerosp. Sci. Computational Neuroscience,’’ in The Handbook of Brain Theory and Technol.,vol.10,pp. 149–155, 2006.. Neural Networks, The MIT Press, Cambridge, MA, 2003. [45] R. M. Milasi, C. Lucas, B. N. Araabi, "Intelligent Modeling and Control [18] J. L. Armony, and J. E. LeDoux, “How the Brain Processes Emotional of Washing Machines Using LLNF Modeling and Modified BELBIC," in Information,” J. Ann. N. Y. Acad., no. 821, pp. 259-270, 1997. Proc. Int. Conf. Control and Automation., pp.812-817, 2005,. [19] S. Grossberg, “The Imbalanced Brain: From Normal Behavior To [46] A. M. Yazdani1, S. Buyamin1, S. Mahmoudzadeh2, Z. Ibrahim1 and M. Schizophrenia,’’ J. Biol. Psychiatry., vol. 48, no. 2, pp. 81-98, 2000. F. Rahmat1., “Brain emotional learning based intelligent controller for [20] T.Dalgleish, “The emotional brain,’’ J. NAT REV NEUROSCI., vol. 5, stepper motor trajectory tracking,” J. IJPS., vol. 7, no. 15, pp. 2364- no. 7, pp. 583–589, 2004. 2386, 2012. [47] M. Fatourechi, C. Lucas, and A.K. Sedigh, "Emotional Learning as a New Tool for Development of Agent-based System," J. Informatica (Slovenia), vol. 27, no. 2, pp.137-144., 2004. [48] S. C. Gadanho, L. Custódio: “Learning Behavior-selection in a Multi- goal Robot Task,” J. Informatica (Slovenia), vol. 27, no. 2, pp. 175-184, 2003. [49] T. Kuremoto, T. Ohta, K., Kobayashi, M., Obayashi, “A dynamic associative memory system by adopting amygdala model,” J. AROB, vol.13, pp.478–482, 2009. [50] T. Kuremoto, T. Ohta, K. Kobayashi, K., M. Obayashi, “A functional model of limbic system of brain,” in Proc. Int. Conf. Brain informatics, pp.135-146, 2009.,. [51] D. Mirikitani and N. Nikolaev, “Recursive Bayesian recurrent neural networks for time-series modeling,” IEEE Trans. Neural Netw., vol. 21, no. 2, pp. 262–274, Feb. 2010. [52] M. M. R. Yousefi, M. Mirmomeni, and C. Lucas, “Input variables selection using mutual information for neuro fuzzy modeling with the application to time series forecasting,” in Proc. Int. Joint Conf. Neural Netw., pp. 1121–1126, 2007. [53] Y. Chen, B. Yang, and J. Dong, “Time-series prediction using a local linear wavelet neural network,” J. Neurocomputing, vol. 69, nos. 4–6, pp. 449–465, 2006. [54] N. I. Sapankevych and R. Sankar, “Time series prediction using support vector machines: A survey,” IEEE Comput. Intell. Mag., vol. 4, no. 2, pp. 24–38, May 2009. [55] D. Li, M. Han, and J. Wang, “Chaotic time series prediction based on a novel robust echo state network,” IEEE Trans. Neural Netw Learn. Syst., vol. 23, no. 5, pp. 787–799, May 2012. [56] S. Hechenbichler, “Weighted k-Nearest-Neighbor Techniques and Ordinal Classification,”, Discussion Paper 399, SFB 386 ,2004 [57] G. Shakhnarovich, T. Darrell,and P.Indyk, Nearest-Neighbor Methods in Learning and Vision:Theory and Practice, MIT Press, March 2006. [58] A.Gholipour, C.Luca, B.N.Araabi, M.Mirmomeni, M.Shafiee, “Extracting the main patterns of natural time series for long-term neurofuzzy prediction,’’ J. Neural Computing & Applications., vol. 16, Issue, 4-5, pp. 383-393, 2007 [59] M. Ardalani-Farsa and S. Zolfaghari, S, “Chaotic time series prediction with residual analysis method using hybrid Elman–NARX neural networks,’’ J. Neurocomputing, vol.73, issues 13–15, pp.2540-2553, 2010. [60] M. Qian-Li, Z. Qi-lun, P. Hong, Z. Tan-Wei and X. Li-Qiang, “Chaotic Time Series Prediction Based on Evolving Recurrent Neural Networks,’’ in Proc. Int. Conf. Machine Learning and Cybernetics (ICMLC.2007), vol.6,no.,pp.3496,3500, 2007. [61] Martinez-Rego, D., Fontenla-Romero, O., and Alonso-Betanzos, A., ‘A method for time series prediction using a combination of linear models’, Proceedings of the European Symposium on Artificial Neural Network– Advances in Computational Intelligence and Learning. [62] R. Chandra and M. Zhang, “Cooperative coevolution of elman recurrent neural networks for chaotic time series prediction,’’J. Neurocomputing, vol. 86, pp.116- 123, 2012. [63] G. Inoussa, H. Peng, and J. Wu, “Nonlinear time series modeling and prediction using functional weights wavelet neural network-based state- dependent AR model,’’ J. Neurocomputing Journal, vol. 86, pp. 59- 74,2012. [64] A. Miranian and M. Abdollahzade, “Developing a Local Least-Squares Support Vector Machines-Based Neuro-Fuzzy Model for Nonlinear and Chaotic Time Series Prediction,” IEEE Trans. Neural Netw, vol.24, no.2, pp. 207-218, 2013. [65] D.T. Mirikitani and N. Nikolaev, “Recursive Bayesian Recurrent Neural Networks for Time-Series Modeling,”, IEEE Trans. Neural Netw, vol.21, no.2, pp.262,274,2010.