Revisited: Machine Intelligence in Heterogeneous Multi Agent Systems

Kaustav Jyoti Borah1, Department of Aerospace Engineering, Ryerson University, Canada email: [email protected]

Rajashree Talukdar2, B. Tech, Department of Computer Science and Engineering, SMIT, India.

Abstract: algorithms has been around for decades and employed to solve various sequential decision-making problems. While dealing with complex environments these algorithms faced many key challenges. The recent development of deep learning has enabled reinforcement learning algorithms to calculate the optimal policies for sophisticated and capable agents. In this paper we would like to explore some algorithms people have applied recently based on interaction of multiple agents and their components. This survey provides insights about the deep reinforcement learning approaches which can lead to future developments of more robust methods for solving many real world scenarios.

1. Introduction

An agent is an entity that perceives its environment through sensors and acting upon that environment through effectors [1]. Multi agent systems is a new technology that has been widely used in many autonomous and intelligent system simulators to perform smart modelling task. It aims to provide both principles for construction of complex systems involving multiple agents and mechanisms for coordination of independent agents’ behaviors. While there is no generally accepted definition of “agent” in AI [2], In case of multi agent systems all agents are independent of each other such a way that they have access independently to the environment. Hence, they need to adopt new circutance, Therefore, each agent should incorporate a learning algorithm to learn and explore the environment. In addition, and due to interaction among the agents in multi-agent systems, they should have some sort of communication between them to behave as a group. However, agents get great importance in academia and commercial applications with help of internet technology.

Based on the heterogeneity of agent systems, multi agent systems can be further subdivided into two categories 1) Homogeneous and 2) Heterogeneous. Homogeneous MAS include agents that all have the same characteristics and functionalities, while heterogeneous MAS include agents with diverse features that means the heterogeneous systems are composed of many subagents or sub components which are not uniform throughout. It is a distributed system contains many kinds of hardware and software working together in cooperative fashion to solve a critical problem. Example of Heterogeneous systems are smart grids, computer networks, Stock Market, Power systems distributions in aircraft etc.

A good point about multi agent learning is that, agent performance improves every moment. Multi agent learning systems is a well-known matter of learning through interactions with their agents and environments. Many of algorithms developed in machine learning can be transferred to settings where there are multiple, interdependent, interacting learning agents. However, they may require modification to account for the other agents in the environment [3,4]. In our paper, we focussed on different machine learning techniques. The paper organises as section 3 describes diffreneces of single and multiple agent, section 4 describes the environments in multi agent system, section 5 describes general heterogeneous system architecture, section 6 machine learning techniques followed by section 7 applications and future work.

2. Ongoing Research

This research is carried out during my PhD studies so far has been focused primarily on multiagent systems and machine learning.

3. Single agent vs multi agent systems

1) When there is only one agent defined in the system environment, it is known as single agent system (SAS) Single agent system interacts only with the environment.

2) SAS is a totally centralized system.

3) Centralized systems have one agent which decides everything like decision and action, while other acts as remote slaves.

4) "single-agent system'' can be called as complex, centralized system in a domain which also allows for a multi-agent approach.

5) A single-agent system might still have multiple entities - several actuators, or even several robots. However, if each entity sends its perceptions to and receives its actions from a single central processor, then there is only a single agent: the central process.

6) Multi Agent systems have more than one agent which interacts itself and ith the environment to give a better suggestion.

7) Multi agent system differ from single agent system most significantly in that the environment's dynamics can be determined by other agents. In addition to the uncertainty that may be inherent in the domain, other agents intentionally affect the environment in unpredictable ways. Thus, all multiagent systems can be viewed as having dynamic environments.

8) Multi-agent systems can manifest self-organization as well as self-direction and other control paradigms and related complex behaviors even when the individual strategies of all their agents are simple. 9) In central agent system all models are a single as a self, that section completes all single agent and multi agent approaches.

10) Multi agent systems are implemented in computer simulations stepping the system through discrete "time steps, they communicate using weighted request matrix. 4. Multi Agent System Environments

Agents can operate in many different types of environments. The main categories are summarised below, in mutually excluding pairs, based on the definitions provided by [30] [31]

1. Static environments: the environment in which the agent operates reacts only based on the agent's action. 2. Dynamic environments: the environment can change without input from the agent, to potentially unknown states. 3.Fully observable environments: the full state of the environment is available to the agent at each point in time. 4.Partially observable environments: parts of the state of the environment are unobservable to the agent. 5. Deterministic environments: an action taken in the current state fully determines the next state of the environment. 6. Stochastic environments: an action taken in the current state can lead to different states. 7. Stationary environment: a stationary environment does not evolve over time and has a predefined set of states. 8.. Non-stationary environment: a non-stationary environment evolves over time and can lead an agent to previously unencountered states.

5. General Heterogeneous Multi agent architecture

Figure 1: General Heterogeneous Architecture

The figure here shows general multi agent scenario for heterogeneous communicating agents. These agents are communicating to each other to fulfill a global goal. In the figure above, we have “m” group of heterogeneous agents where A= {A1, A2, … Am} are the group of agents. Inside group 1, we assume that all agents are homogeneous. All the arrow mark with blue colours shows a communication link between the agents. 6. Machine learning Techniques

Multi Agent systems is a well-known subfield of Artificial Intelligence area and make significant impact on creating and solving very complex framework problems. Well there is no specific definition of agent in multi agent systems hence as stated earlier we consider agents as entity with goals, action and domain knowledge in the environment. They will act as a “behaviour”. Although the ability to consider coordinating behaviors of autonomous agents is a new one, the field is advancing quickly by building upon pre- existing work in the field of Distributed Artificial Intelligence (DAI) [2].

Machine learning perspective is always distinguished by what kind of reward or return that critic can provide as a feed back to learner. There are three main techniques [5] that were discussed widely earlier was 1) Supervised Learning 2) Unsupervised learning and 3) Reinforcement learning. In supervised learning, the critic provides the correct output of the system where as in unsupervised learning, no feedback is at all guaranteed. While in reinforcement learning, a new method is introduced termed as actor-critic techniques which is applied to the learner’s output which provides an ultimate reward. In all three cases, the learning feedback is assumed to be provided by the system environment or the agents themselves.

The use of multiagent reinforcement learning algorithms has attracted attention because of its generality and robustness. Those techniques have been applied in both stationary and non-stationary environments.

6.1 Supervised Learning

It is a machine learning technique that can learn a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training example. Supervised learning considers each example as a pair that consists of an input and a desired output. Supervised learning-based algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Optimal scenario will allow for algorithm to determine the perfect class labels for critical and unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. There are lots of complexity arises when multiple agents interact each other, hence supervised learning algorithms cannot be applied directly because they typically assume a critic that can provide the agents with the “correct” behavior [5] for a given situation. Sniezynski [9] used supervised learning method for the fish bank game. Garland and Alterman [10] are using learning coordinated procedures to enhance coordination in environment of heterogeneous agents. Williams [11] and his team described inductive learning methods to learn individual’s ontologies in the field of semantic webs. Gehrke and Wojtusiak [12], in their research used a rule induction technique for route planning. Airiau et al. [13] add learning capabilities into BDI model, in which decision tree learning is used to support plan applicability testing. Table 1 shows each one of the above supervised learners compared to the learning aspects [5] Table 1: Supervised learners compared to learning aspects

Centralization Interaction Learning Goal Sniezynski [9] Centralized No 2-Agents Selfish Garland and De-Centralized Yes All- Agents Cooperative Alterman [10] Williams [11] De-Centralized Yes All- Agents Cooperative Gehrke and De-Centralized Yes All- Agents Cooperative Wojtusiak [12] Airiau et al. [13] De-Centralized Yes All- Agents Cooperative

6.2 Unsupervised Learning

In this case, there is no explicit concepts are given. Some reviews about unsupervised learning algorithms says that they are not suitable for multi agent systems learning. It is an aimless learning approach where multi agent systems main aim is to improve the overall system performance. Hence there is no notable research have seen in this area. However, researchers tried to use unsupervised learning algorithms to solve subsidiary issues that help agent through its learning [5, 14,15].

6.3 Reinforcement Learning

Reinforcement learning is a method that matches the agent paradigm exactly, Unlike the other two stated method , supervised and unsupervised learning where the learner must be supplied with healthy data. Reinforcement learning allows an autonomous agent that has no knowledge of a systems behaviour and/or the environment by progressively improving its performance based on given rewards as the learning task is performed [15]. Thus, the very large majority of people in this area used reinforcement learning algorithms. As stated in [5] ,the reinforcement learning algorithms for multi-agent systems literature can be divided into two subsets: 1) Estimating value functions-based methods; and 2) Stochastic search-based methods in which agents directly learn behaviors without appealing to value functions and it concentrates on evolutionary . For learning methods based on estimate value functions, we have single and multiple agents’ algorithms [5]. The environment in reinforcement learning also knows a Markov decision Process also known as MDP as most of reinforcement learning algorithms uses dynamic programming approach. There are many reinforcements learning based algorithms. Comparison of various reinforcement learning algorithms are [5] Table 2: Comparison of various Reinforcement learning algorithms

Researchers Centralization Interaction Learning Final Goal b/w agents Bowling and D Yes All-Agents Cooperative Veloso [22] and Competitive Barto et al. C No All-Agents Selfish [19] Sutton [20] C No All-Agents Selfish Moore and C No All-Agents Selfish Atkeson [21]

Greenwald D Yes All-Agents Competive and Hall [23] Kononen [24] D Yes One-Agent Cooperative Lagoudakis C No One-Agent Selfish and Parr [25] McGlohon and D Yes All-Agents Cooperative Sen [26] Qi and Sun D Yes All-Agents Cooperative [27] Puterman [17] C No All-Agents Selfish Watkins and C No All-Agents Selfish Dayan [18] Bertsekas [16] C No All-Agents Selfish

Table 3: Standard algorithms for Reinforcement learning

Algorithm Description Model Policy Action State Operator Space Space Monte Every visit to Model Off Discrete Discrete Sample- Carlo monte Carlo Free Means Q Learning State Action Model Off Discrete Discrete Q-Value reward state Free SARSA State action Model On Discrete Discrete Q-Value reward state Free action

6.3.1 Monte-Carlo Method

Monte–Carlo Method [32] determines value function by repeatedly generating episodes and it records average return at each state or each state action-pair. Therefore, the state value function is calculated as follows:

푀퐶 푖 푉휋 (푠) = lim 퐸[푟 (푠푡)|푠푡 = 푠, 휋] (1) 푖→+∞ Where ri ( st ) denotes observed returns at state st in episode ith. Similarly, we have the value function of state action pair:

푀퐶 푖 푄휋 (푠, 푎) = lim 퐸[푟 (푠푡, 푎푡)| 푠푡 = 푠, 푎푡 = 푎, 휋] (2) 푖→+∞

Knowledge of transitional probabilities is not required for this method i.e, MC method is model free. However, this approach has made two essential assumptions in order to ensure the convergence happen [32]

1) the number of episodes is large and

2) every state and every action should be visited with a significant number of times. In order to make this ‘’exploration’’ possible we use e-greedy strategy in policy improvement instead of:

′ ′ 휖 푅푆: 휋 → 휋 = 휓 (푠) = 1 − 휖 + ,푎푖 = 푎푗Λ푗 = 푎푟푔푚푎푥 푄Π(푠. 푎푘) (3) |ΔΠ(푠)|

휖 , ∀푎푖 ∈ ∆휋 ∧ 푎푖 ≠ 푎푗 |ΔΠ(푠)

Where |  π (s)| denotes number of candidate action taken in state s and 0<ϵ<1.Generally MC algorithm are divided into two groups: on-policy and off-policy. In on-policy method we use policy π for both exploration and evaluation purpose. Therefore, the policy π must be stochastic or soft. But off-policy uses different policy π’≠ π in order to generate the episodes and hence π can be deterministic. Off-policy method is desirable due to its simplicity, but on-policy method is more stable when working with continuous state space problems and when using together with function approximator such as neural networks.

6.3.2 Temporal Difference Method

Similar to MC ,temporal difference method [32] is also learning from experiences i.e, model-free method. This method doesn’t wait till the episode is over in order to make an update. It makes update in every step within the episode by leveraging 1-step Bellman equation and hence possibly providing faster convergence:

푖 푖−1 푖−1 푈1: 푉 (푠푡) ←a푉 (푠푡) + (1 − 훼)(푟푖+1 + 훾푉 (푠푡+1) (4)

Where α is a step size parameter and 0< α <1.It uses previous estimated values V i-1 in order to update the current one’s V i ,which is known as bootstrapping method, which learn faster than the non-bootstrapping methods in most of the cases.TD learning is also divided into two major categories: On- policy TD learning (Sarsa) and Off-policy TD learning(Q-learning). In Sarsa algorithm estimates value function of state action pair based on:

푖 푖−1 푖−1 푈2: 푄 (푠푖, 푎푡) ← 훼푄 (푠푡,푎푡) + (1 − 훼)(푟푡+1 + 훾푄 (푠푡+1, 푎푡+1) (5)

On the other hand, Q-leaning uses one step optimality bellman equation to perform the update that is Q-learning directly approximates value function of optimal policy:

푈 : 푄푖(푠 , 푎 ) ← 훼푄푖−1(푠 푎 ) + (1 − 훼)(푟 + 훾 max 푄푖−1 (푠 , 푎푗 ) ) (6) 3 푡 푡 푡, 푡 푡+1 푗 푖+1 푡+1 푎푡+1 It can be noticed that the operator max in update rule substitutes for a deterministic policy and this strongly explains why Q-learning is off-policy.

Both these method uses tabular structure in order to store the value function of each state or each function pair. This makes insufficient due to lack of memory in solving complicated problems where number of stages are more. Therefore actor-critic(AC) method is introducing to overcome this limitations.AC includes two memory structure for an agent [32]: actor structure is used for selecting suitable action according to the observed state and transfer to critic structure for evaluation. Critic structure uses the following TD error to decide the future tendency of the selected action.

( ) ( ) 훿 푎푡 = 훽(푟푡+1 + 훾푉(푠푡+1)) − 1 − 훽 푉(푆푡) (7) 6.3.3 Q learning

It a model free reinforcement learning based approach to learn the agent systems and/or subsystems. The ultimate goal of a Q learning is to compute a state to action that is so called policy that leads to a maximum utility for the agent. After execution of one action there will be reward for each policy generated. In simple word, agent learns by exploring the environments that means experimenting the different actions by observing a reward. In a finite markov decision process, Q learning can identify an optimal action selection given infinite exploration time and a party random policy. The result of Q-learning, i.e., the policy, may be seen as a table that assigns each state–action pair (s, a) a numerical value, which is an estimate of the (possibly long-term) reward to be received when executing a in s. After receiving a reward, an agent updates the numerical value of the state–action pair based on the reward and on the estimated best reward to be gained in the new state. Thus, with time, the agent is able to improve its estimates of the rewards to be received for all state–action pairs. Due to space restrictions, we do not present details of the Q-learning algorithm but rather refer the reader to [7] or [8].

There are some advantages of Q-learning which efficiency is , guaranteed convergence toward the optimum, and natural applicability to agent systems because of the coupling of learning and exploration. However, it has drawbacks that includes 1) Designing numerical suitable reward function can be a nontrivial task. 2) Convergence to optimal and 3) there is no explanation is given for any action preferences in learning the result.

7. Applications and Future work

One major application in this area is to develop an algorithm which can perform better in airport ground handling management system. In this communication the ground handling fleet management problem [28] at airports is considered with the aim of improving aircraft service at arrival and departure while the operational cost of the ground service fleets is taken into account. The complexity of the considered problem, as well as, operational considerations lead to propose an on-line decentralized management structure where the criticality of each aircraft demand for service is evaluated using a particular fuzzy formalism. Then after detailing the assumed collaborative scheme between ground handling fleet managers, airlines and airport authorities, a machine learning technique will address to solve this multi agent-based problem.

Another example includes stock market prediction. A requirement analysis for the portfolio management in the stock trading [29] presents a conduct viability and theoretical foundation for a stock trading system. The overall portfolio management tasks include eliciting user profiles, collecting information on the user’s initial portfolio position, monitoring the environment on behalf of the user, and making decision suggestions to meet the user’s investment goals. Based on the requirement analysis, Davis [29] presents a framework for a Multi-Agent System for Stock Trading (MASST). The key issues it addresses include gathering and integrating diverse information sources with collaborating agents and providing decision-making for investors in the stock market. We identify the candidate agents and the tasks that the agents perform. Agent communications and exchange of information and knowledge between agent has been described.

8. Conclusion

We presented some machine learning techniques for agents and multi agent systems. Furthermore, we presented two examples of multi agent-based systems that are being developed at Ryerson University, Toronto, Canada. While machine learning methods applied for multi agent systems still require more attention to prove their practicality. ML for multi agents is still a relatively young research area, and there are many issues that still need further research, some of which have already been mentioned.

References

[1] Shoham, Y., and Leyton-Brown, K. Multiagent Systems Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press. 2009.

[2] Peter Stone and Manuela Veloso. “Multiagent Systems: A Survey from a Machine Learning Perspective”, Autonomous Robotics volume 8, number 3. July 2008.

[3] Yang, Z., and Shi, X. “An agent-based immune evolutionary learning algorithm and its application”, In Proceedings of the Intelligent Control and Automation (WCICA). 5008-5013. 2014.

[4] Qu, S., Jian, R., Chu, T., Wang, J., and Tan, T. “Computational Reasoning and Learning for Smart Manufacturing Under Realistic Conditions”, In Proceedings of the Behavior, Economic and Social (BESC) Conferences, 1-8. 2014.

[5] Khaled M. Khalil, Mohamed Abdelaziz and Abdel- Badeeh M. Salem, “Machine Learning Algorithms for multi Agent systems” 2015.

[6] Yoad Lewenberg, 2017 “Machine Learning Techniques for Multiagent Systems” Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) pp. 5185-5186, 2017.

[7] Mitchell, T. Machine Learning. New York: McGraw-Hill, 1997.

[8] Kaebling, L.P., Littman, M.L., & Moore, A.W. “Reinforcement learning: a survey”, Journal of Artificial Intelligence Research, 4. 1996.

[9] Sniezynski, B. “Supervised Rule Learning and Reinforcement Learning in A Multi-Agent System for the Fish Banks Game”, Theory and Novel Applications of Machine Learning. 2009.

[10] Garland, A., and Alterman, A. “Autonomous agents that learn to better coordinate”, Autonomous Agents and Multi-Agent Systems 8, 267-301. 2004.

[11] Williams, A. “Learning to share meaning in a multi-agent system”, Autonomous Agents and Multi-Agent Systems 8, 165-193. 2004.

[12] Gehrke, J. D., and Wojtusiak, J. “Traffic Prediction for Agent Route Planning”, In Proceedings of the International Conference on Computational Science, 692-701. 2008.

[13] Airiau, S., Padham, L., Sardina, S., and Sen, S. “Incorporating Learning in BDI Agents”, In Adaptive Learning Agents and Multi- Agent Systems Workshop (ALAMAS+ALAg-08), 2008.

[14] Kiselev, A. A self-organizing multi-agent system for online unsupervised learning in complex dynamic environments. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, 1808-1809, 2008.

[15] Sadeghlou, M., Akbarzadeh-T, M. R., and Naghibi-S, M. B. “Dynamic agent-based reward shaping for multi-agent systems”, In Proceedings of the Iraniance Conference on Intelligent Systems (ICIS), 1-6, 2014.

[16] Bertsekas, D. P. Dynamic Programming and Optimal control, 2nd Ed., Athena Scientific, 2001.

[17] Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 1st Ed., Wiley, 2008.

[18] Watkins, C. J. C. H., and Dayan, P. Q-learning. Machine Learning 8, 279-292, 1992.

[19] Barto, A. G., Sutton, R. S., and Anderson, C. W. “Neuronlike adaptive elements that can solve difficult learning control problems”, IEEE Transactions on Systems, Man, and Cybernetics 5, 843- 846, 1983, 1992.

[20] Sutton, R. S. “Integrated architectures for learning, planning, and reacting based on approximating dynamic programming”, In Proceedings of the Seventh International Conference on Machin Learning (ICML-90), Austin, US, 216-224, 1990.

[21] Moore, A. W., and Atkeson, C. G. “Prioritized sweeping: Reinforcement learning with less data and less time”, Machine Learning 13, 103-130, 1993.

[22] Bowling, M., and Veloso, M. “Multiagent learning using a variable learning rate”, Artificial Intelligence 136, 215-250, 2002.

[23] Greenwald, A., and Hall, K. “Correlated-Q learning”, In Proceedings of the Twentieth International Conference on Machine Learning (ICML-03), Washington, US, 242-249, 2003.

[24] Kononen, V. “Gradient descent for symmetric and asymmetric multiagent reinforcement learning”, Web Intelligence and Agent Systems 3, 17-30, 2005.

[25] Lagoudakis, M. G., and Parr, R. “Least-squares policy iteration”, Machine Learning Research 4, 1107-1149, 2003.

[26] McGlohon, M., and Sen, S. “Learning to Cooperate in Multi- Agent Systems by Combining Q- Learning and Evolutionary Strategy”, In Proceedings of the World Conference on Lateral Computing, 2004.

[27] Qi, D., and Sun, R. “A multi-agent system integrating reinforcement learning, bidding and genetic algorithms”, Web Intelligence and Agent Systems 1, 187-202, 2003.

[28] Salma Fitouri Trabelsi, carlos Alberto Nunes Cosenza, Luis Gustavo Zelaya Cruz, Felix Mora- Camino “AN operational approach for ground handling management at airports with imperfect information” 19th International Conference on Industrial Engineering and Operations Management, Jul, Valladolid, Spain, 2013.

[29] Darryl Davis, Yuan Luo and Kecheng Liu, “A Multi-Agent Framework for Stock Trading” School of Computing, Staffordshire University, Stafford ST18 0DG, UK, Department of Computer Science, University of Hull, HU6 7RX, UK 2000/8

[30] Andrei Marinescu “ Prediction-Based Multi-Agent Reinforcement Learning for Inherently Non- Stationary Environments” PhD thesis, Computer Science, University of Dublin, Trinity College, 2016.

[31] Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach. Prentice Hall, 2003.

[32] Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi, “ Deep Reinforcement Learning for Multi-Agent Systems: A Review of Challenges, Solutions and Applications” retrieved from arXiv:1812.11794v2 [cs.LG] 6 Feb 2019.