The Reinforcement Learning Problem (Markov Decision Processes, Or Mdps)

Total Page:16

File Type:pdf, Size:1020Kb

The Reinforcement Learning Problem (Markov Decision Processes, Or Mdps) Chapter 3: The Reinforcement Learning Problem (Markov Decision Processes, or MDPs) Objectives of this chapter: ❐ present Markov decision processes—an idealized form of the AI problem for which we have precise theoretical results ❐ introduce key components of the mathematics: value functions and Bellman equations R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM Agent Agent 44 CHAPTER 3. THE REINFORCEMENT LEARNING PROBLEM state reward action state reward St Rt action At St Rt Rt+1 At SUMMARY OF NOTATIONAgent xiii Rt+1 St+1 Environment SUMMARYstate reward OF NOTATION xiii St+1 EnvironmentThe Agent-Environment Interface action St Rt Summary of Notation At Figure 3.1: TheRt+1 agent–environment interaction in reinforcement learning. SUMMARYSummary OF NOTATION of Notation xiii Figure 3.1: The agent–environment interaction in reinforcement learning. St+1 Environment Agent Capital letters are used for random variables and major algorithm variables. SummaryLowerFiguregivesCapital 3.1: case rise The letters letters to agent–environment ofrewards, are are Notation used used special for for interaction the numerical random values in reinforcement variables of values random that and learning. variables the major agent algorithm and tries for to scalar maximize variables. gives rise to rewards, special numerical valuesstate thatreward the agent tries to maximizefunctions.overLower time. Quantitiescaseaction A letters complete that are areused specification required for the to values of be an real-valued of environment random vectors variables defines are and awrittentask for,one scalar St Rt A over time. A complete specification of an environment defines a task,onegivesin bold riseinstancefunctions. to and rewards, in of lower thet Quantities special reinforcement case numerical (even that values if are learning random that required the agentvariables). problem. to tries be to real-valued maximize vectors are written Rt+1 Capital letters are used for random variables and major algorithm variables. instance of the reinforcement learning problem. over time.in bold A complete and in specification lower case of (even an environment if random defines variables). a task,one St+1 EnvironmentLower caseMore letters specifically, are used the for agent the values and environment of random variables interact and at each for scalar of a sequence instance of the reinforcement learning problem. More specifically, the agent and environment interact at each of a sequencefunctions.s of discretestate Quantities time steps, that aret =0 required, 1, 2, 3,... to be.2 At real-valued each time vectors step t, are the written agent receives 2 aMores specifically,actionstate the agent and environment interact at each of a sequence of discrete time steps, t =0, 1, 2, 3,.... At each time step t, the agent receivesin boldsome and representation in lower case (even of the2 if environment’s random variables).state, St S, where S is the set of Agent and environment interact atof diS discretescreate ti timemeset s steps,teps ofaction: alltt =0= nonterminal0, 1, 22,,3K,.... At states each time step t, the agent receives2 some representation of the environment’s state, St S, where S is the set of possible states, and on that basis selects an action, At A(St), where A(St) 2 some+ representation of the environment’s state, St S, where S is the set of Agent observeAs state at step t:A SSt ∈ S set ofset all of states, all nonterminal including the states2 terminal state 2 possible states, and on that basis selects an action, At (St), where possible(sSt) is states, thestate set and of on actions that basis available selects an inaction state, At StA.(S Onet), where timeA( stepSt) later, in part as a 2 S+ set of all states, including the2 terminal state is the set of actions available in state St. One produc timees stepaction later, at step in t : part At ∈is asA the(( aSstconsequence)setofactionspossibleinstate set) of actions available of its action, in state theSt. One agent time receivess step later, a numerical in part as a reward, R R a A action t+1 2 ⇢ consequence of its action, the agent receives age numericalts resultingreward reward:, R R ∈ consequenceR (s)setofactionspossibleinstate of itsR action, the agent receives a numerical rewards , Rt+1 R,and 3 t+1t+1 RS R, whereset of allis the nonterminal set3 of possible states rewards, and finds2 itself in a new state, St+1. 2 finds⇢3 itself in a new state, St+1. Figure 3.1 diagrams the agent–environment , where R is the set of possible rewards, and finds itself in a new state, St+1t+. Figurediscrete 3.1 diagrams time step the agent–environment interaction. R and resulting next state: St+1 ∈interaction.S set of all states, including the terminal state Figure 3.1 diagrams the agent–environment interaction. T! =t s0,a0final,s1,adiscrete time1,... step time of an step episode A(s)setofactionspossibleinstateAt each time step, the agent implementss a mapping from states to prob- SAt eachT timestate step,final at thet time agent step implements of an a episode mapping from states to prob- abilitiest of selecting each possible action. This mapping is called the agent’s At each time step, the agent implements. a mappingR fromt+1 statesRt+ toThe2 prob- otherabilitiesRSt+3 random ofstate variables selecting. at t are each a possiblefunction action. of this sequence. This mapping The transitional is called the agent’s St St+1 St+2 At t St+3action at t abilities of selecting each possible action. This mappingAt is calledAt+ the1 agent’spolicyt At+2 and isdiscrete denotedAt+3 ⇡ timet, where step⇡t⇡(a s)istheprobabilitythat⇡ a s At = a if St = s. A a S s target rtpolicy+1At is aand functionaction is denoted of at stt, t|,at where, and stt(+1.)istheprobabilitythat The termination conditiont =βt, if t = . ReinforcementTRt finalreward learning time at methods stept, dependent, of specify an episode how like theS agent|t,on changesAt 1 and its policySt 1 as policy and is denoted ⇡t, where ⇡t(a s)istheprobabilitythatAt = a ifterminalSt = s. Reinforcement target z , and prediction learning methodsy , are functions specify how of−s thealone. agent− changes its policy as R. S. Sutton and A. G.| Barto: Reinforcement Learning: An Introduction aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeG Rt returnt reward (cumulative at t, dependent, discountedt 2 like reward)St,ont followingAt 1 andt St 1 Reinforcement learning methods specify how the agent changes its policySt ast state at t − − the( totaln)aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeGt amount ofreturn reward it (cumulative receives over the discounted long run. reward) following t AGtt (n) (n) actionn-step at returnt (Section 7.1) (n 1) aresultofitsexperience.Theagent’sgoal,roughlyspeaking,istomaximizeR the= totalrt+1 + amountβt+1zt+1 of+(1 rewardβ itt+1 receives)R − over the long run. Thisλt G frameworkt isn-step abstract return and flexible (Section and can 7.1)t+1 be applied to many di↵erent RGtt rewardλ-return at t (Section, dependent, 7.2)− like St,onAt 1 and St 1 the total amount of reward it receives over the long run. problemsλ in many di↵erent ways. For example, the time− steps need− not refer Gt This frameworkλ-return is (Section abstract 7.2) and flexible and can be applied to many di↵erent toG fixedt (0) intervalsreturn of real (cumulative time; they can discounted refer to arbitrary reward) successive following stagest of This framework is abstract and flexible and can be applied to many di↵erentR(nt)problems= yt in many di↵erent ways. For example, the time steps need not refer decision-makingG⇡ npolicy,-step and acting. return decision-making The (Section actions 7.1) can rule be low-level controls, such as the problems in many di↵erent ways. For example, the time steps need not refert to⇡ fixed intervalspolicy, of decision-making real time; they rule can refer to arbitrary successive stages of voltagesG⇡λ(s)actiontakeninstate appliedλ-return to the (Section motors of 7.2) a robots under arm, ordeterministic high-level decisions,policy such⇡ t λ 1 n 1 (n) to fixed intervals of real time; they can refer to arbitrary successive stagesasR whetherof decision-making⇡=(1(s)actiontakeninstate or notλ) to haveλ − and lunchR acting. or to go The to graduates actionsunder school.deterministic can be Similarly, low-level thepolicy controls,⇡ such as the ⇡(at s)probabilityoftakingaction− t a in state s under stochastic policy ⇡ decision-making and acting. The actions can be low-level controls, such asstates the |voltages can⇡(a takes)probabilityoftakingaction a wide appliedn=1 variety to of the forms. motors They can of be a completely robota in arm, state determined ors under high-level by stochastic decisions,policy such⇡ ⇡p(s0 s, a)probabilityoftransitionfromstate| policy,X decision-making rule s to state s0 under action a voltages applied to the motors of a robot arm, or high-level decisions, such as|p(s whether0 s, a)probabilityoftransitionfromstate or not to have lunch or to go to graduates to state school.s0 under Similarly, action a the wider⇡r((ss,)actiontakeninstate audience. a, s0)expectedimmediaterewardontransitionfrom| s under deterministic policy s⇡ to s0 under action a as whether or not to have lunch or to go to graduate school. Similarly, the2 statesr(⇡s,( a,st can,a s0)expectedimmediaterewardontransitionfromt) take a wide variety of forms. They can be completelys to s determined0 under action by a ⇡(⇢Weat s= restrict)probabilityoftakingaction attention to discrete time to keep thingsa asin simple state ass possible,under evenstochastic though policy ⇡ states can take a wide variety of forms. They can be completely determinedmany by| of theb( ideasst,a cant) be extended to the continuous-time case (e.g., see Bertsekas and Tsitsiklis,pv(⇡s(0ss,)valueofstate a 1996;)probabilityoftransitionfromstate Werbos, 1992; Doya,s 1996).under policy ⇡ (expecteds to state return)s0 under action a 3 widerv (s)valueofstate audience.
Recommended publications
  • Survey on Reinforcement Learning for Language Processing
    Survey on reinforcement learning for language processing V´ıctorUc-Cetina1, Nicol´asNavarro-Guerrero2, Anabel Martin-Gonzalez1, Cornelius Weber3, Stefan Wermter3 1 Universidad Aut´onomade Yucat´an- fuccetina, [email protected] 2 Aarhus University - [email protected] 3 Universit¨atHamburg - fweber, [email protected] February 2021 Abstract In recent years some researchers have explored the use of reinforcement learning (RL) algorithms as key components in the solution of various natural language process- ing tasks. For instance, some of these algorithms leveraging deep neural learning have found their way into conversational systems. This paper reviews the state of the art of RL methods for their possible use for different problems of natural language processing, focusing primarily on conversational systems, mainly due to their growing relevance. We provide detailed descriptions of the problems as well as discussions of why RL is well-suited to solve them. Also, we analyze the advantages and limitations of these methods. Finally, we elaborate on promising research directions in natural language processing that might benefit from reinforcement learning. Keywords| reinforcement learning, natural language processing, conversational systems, pars- ing, translation, text generation 1 Introduction Machine learning algorithms have been very successful to solve problems in the natural language pro- arXiv:2104.05565v1 [cs.CL] 12 Apr 2021 cessing (NLP) domain for many years, especially supervised and unsupervised methods. However, this is not the case with reinforcement learning (RL), which is somewhat surprising since in other domains, reinforcement learning methods have experienced an increased level of success with some impressive results, for instance in board games such as AlphaGo Zero [106].
    [Show full text]
  • Probabilistic Goal Markov Decision Processes∗
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Probabilistic Goal Markov Decision Processes∗ Huan Xu Shie Mannor Department of Mechanical Engineering Department of Electrical Engineering National University of Singapore, Singapore Technion, Israel [email protected] [email protected] Abstract also concluded that in daily decision making, people tends to regard risk primarily as failure to achieving a predeter- The Markov decision process model is a power- mined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970; ful tool in planing tasks and sequential decision Payne et al., 1980; 1981]. making problems. The randomness of state tran- Different variants of the probabilistic goal formulation sitions and rewards implies that the performance 1 of a policy is often stochastic. In contrast to the such as chance constrained programs , have been extensively standard approach that studies the expected per- studied in single-period optimization [Miller and Wagner, formance, we consider the policy that maximizes 1965; Pr´ekopa, 1970]. However, little has been done in the probability of achieving a pre-determined tar- the context of sequential decision problem including MDPs. get performance, a criterion we term probabilis- The standard approaches in risk-averse MDPs include max- tic goal Markov decision processes. We show that imization of expected utility function [Bertsekas, 1995], this problem is NP-hard, but can be solved using a and optimization of a coherent risk measure [Riedel, 2004; pseudo-polynomial algorithm. We further consider Le Tallec, 2007]. Both approaches lead to formulations that a variant dubbed “chance-constraint Markov deci- can not be solved in polynomial time, except for special sion problems,” that treats the probability of achiev- cases including exponential utility function [Chung and So- ing target performance as a constraint instead of the bel, 1987], piecewise linear utility function with a single maximizing objective.
    [Show full text]
  • 1 Introduction
    SEMI-MARKOV DECISION PROCESSES M. Baykal-G}ursoy Department of Industrial and Systems Engineering Rutgers University Piscataway, New Jersey Email: [email protected] Abstract Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed. 1 Introduction Semi-Markov decision processes (SMDPs) are used in modeling stochastic control problems arrising in Markovian dynamic systems where the sojourn time in each state is a general continuous random variable. They are powerful, natural tools for the optimization of queues [20, 44, 41, 18, 42, 43, 21], production scheduling [35, 31, 2, 3], reliability/maintenance [22]. For example, in a machine replacement problem with deteriorating performance over time, a decision maker, after observing the current state of the machine, decides whether to continue its usage, or initiate a maintenance (preventive or corrective) repair, or replace the machine. There are a reward or cost structure associated with the states and decisions, and an information pattern available to the decision maker. This decision depends on a performance measure over the planning horizon which is either finite or infinite, such as total expected discounted or long-run average expected reward/cost with or without external constraints, and variance penalized average reward. SMDPs are based on semi-Markov processes (SMPs) [9] [Semi-Markov Processes], that include re- newal processes [Definition and Examples of Renewal Processes] and continuous-time Markov chains (CTMCs) [Definition and Examples of CTMCs] as special cases.
    [Show full text]
  • Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy
    Graph-based State Representation for Deep Reinforcement Learning Vikram Waradpande, Daniel Kudenko, and Megha Khosla L3S Research Center, Leibniz University, Hannover {waradpande,khosla,kudenko}@l3s.de Abstract. Deep RL approaches build much of their success on the abil- ity of the deep neural network to generate useful internal representa- tions. Nevertheless, they suffer from a high sample-complexity and start- ing with a good input representation can have a significant impact on the performance. In this paper, we exploit the fact that the underlying Markov decision process (MDP) represents a graph, which enables us to incorporate the topological information for effective state representation learning. Motivated by the recent success of node representations for several graph analytical tasks we specifically investigate the capability of node repre- sentation learning methods to effectively encode the topology of the un- derlying MDP in Deep RL. To this end we perform a comparative anal- ysis of several models chosen from 4 different classes of representation learning algorithms for policy learning in grid-world navigation tasks, which are representative of a large class of RL problems. We find that all embedding methods outperform the commonly used matrix represen- tation of grid-world environments in all of the studied cases. Moreoever, graph convolution based methods are outperformed by simpler random walk based methods and graph linear autoencoders. 1 Introduction A good problem representation has been known to be crucial for the perfor- mance of AI algorithms. This is not different in the case of reinforcement learn- ing (RL), where representation learning has been a focus of investigation.
    [Show full text]
  • A Deep Reinforcement Learning Neural Network Folding Proteins
    DeepFoldit - A Deep Reinforcement Learning Neural Network Folding Proteins Dimitra Panou1, Martin Reczko2 1University of Athens, Department of Informatics and Telecommunications 2Biomedical Sciences Research Center “Alexander Fleming” ABSTRACT Despite considerable progress, ab initio protein structure prediction remains suboptimal. A crowdsourcing approach is the online puzzle video game Foldit [1], that provided several useful results that matched or even outperformed algorithmically computed solutions [2]. Using Foldit, the WeFold [3] crowd had several successful participations in the Critical Assessment of Techniques for Protein Structure Prediction. Based on the recent Foldit standalone version [4], we trained a deep reinforcement neural network called DeepFoldit to improve the score assigned to an unfolded protein, using the Q-learning method [5] with experience replay. This paper is focused on model improvement through hyperparameter tuning. We examined various implementations by examining different model architectures and changing hyperparameter values to improve the accuracy of the model. The new model’s hyper-parameters also improved its ability to generalize. Initial results, from the latest implementation, show that given a set of small unfolded training proteins, DeepFoldit learns action sequences that improve the score both on the training set and on novel test proteins. Our approach combines the intuitive user interface of Foldit with the efficiency of deep reinforcement learning. KEYWORDS: ab initio protein structure prediction, Reinforcement Learning, Deep Learning, Convolution Neural Networks, Q-learning 1. ALGORITHMIC BACKGROUND Machine learning (ML) is the study of algorithms and statistical models used by computer systems to accomplish a given task without using explicit guidelines, relying on inferences derived from patterns. ML is a field of artificial intelligence.
    [Show full text]
  • An Introduction to Deep Reinforcement Learning
    An Introduction to Deep Reinforcement Learning Ehsan Abbasnejad Remember: Supervised Learning We have a set of sample observations, with labels learn to predict the labels, given a new sample cat Learn the function that associates a picture of a dog/cat with the label dog Remember: supervised learning We need thousands of samples Samples have to be provided by experts There are applications where • We can’t provide expert samples • Expert examples are not what we mimic • There is an interaction with the world Deep Reinforcement Learning AlphaGo Scenario of Reinforcement Learning Observation Action State Change the environment Agent Don’t do that Reward Environment Agent learns to take actions maximizing expected Scenario of Reinforcement Learningreward. Observation Action State Change the environment Agent Thank you. Reward https://yoast.com/how-t Environment o-clean-site-structure/ Machine Learning Actor/Policy ≈ Looking for a Function Action = π( Observation ) Observation Action Function Function input output Used to pick the Reward best function Environment Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making • RL is for an agent with the capacity to act • Each action influences the agent’s future state • Success is measured by a scalar reward signal Goal: select actions to maximise future reward Deep Learning in a nutshell DL is a general-purpose framework for representation learning • Given an objective • Learning representation that is required to achieve objective • Directly from raw inputs
    [Show full text]
  • A Brief Taster of Markov Decision Processes and Stochastic Games
    Algorithmic Game Theory and Applications Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Kousha Etessami Kousha Etessami AGTA: Lecture 15 1 warning • The subjects we will touch on today are so interesting and vast that one could easily spend an entire course on them alone. • So, think of what we discuss today as only a brief “taster”, and please do explore it further if it interests you. • Here are two standard textbooks that you can look up if you are interested in learning more: – M. Puterman, Markov Decision Processes, Wiley, 1994. – J. Filar and K. Vrieze, Competitive Markov Decision Processes, Springer, 1997. (This book is really about 2-player zero-sum stochastic games.) Kousha Etessami AGTA: Lecture 15 2 Games against Nature Consider a game graph, where some nodes belong to player 1 but others are chance nodes of “Nature”: Start Player I: Nature: 1/3 1/6 3 2/3 1/3 1/3 1/6 1 −2 Question: What is Player 1’s “optimal strategy” and “optimal expected payoff” in this game? Kousha Etessami AGTA: Lecture 15 3 a simple finite game: “make a big number” k−1k−2 0 • Your goal is to create as large a k-digit number as possible, using digits from D = {0, 1, 2,..., 9}, which “nature” will give you, one by one. • The game proceeds in k rounds. • In each round, “nature” chooses d ∈ D “uniformly at random”, i.e., each digit has probability 1/10. • You then choose which “unfilled” position in the k-digit number should be “filled” with digit d.
    [Show full text]
  • Modelling Working Memory Using Deep Recurrent Reinforcement Learning
    Modelling Working Memory using Deep Recurrent Reinforcement Learning Pravish Sainath1;2;3 Pierre Bellec1;3 Guillaume Lajoie 2;3 1Centre de Recherche, Institut Universitaire de Gériatrie de Montréal 2Mila - Quebec AI Institute 3Université de Montréal Abstract 1 In cognitive systems, the role of a working memory is crucial for visual reasoning 2 and decision making. Tremendous progress has been made in understanding the 3 mechanisms of the human/animal working memory, as well as in formulating 4 different frameworks of artificial neural networks. In the case of humans, the [1] 5 visual working memory (VWM) task is a standard one in which the subjects are 6 presented with a sequence of images, each of which needs to be identified as to 7 whether it was already seen or not. Our work is a study of multiple ways to learn a 8 working memory model using recurrent neural networks that learn to remember 9 input images across timesteps in supervised and reinforcement learning settings. 10 The reinforcement learning setting is inspired by the popular view in Neuroscience 11 that the working memory in the prefrontal cortex is modulated by a dopaminergic 12 mechanism. We consider the VWM task as an environment that rewards the 13 agent when it remembers past information and penalizes it for forgetting. We 14 quantitatively estimate the performance of these models on sequences of images [2] 15 from a standard image dataset (CIFAR-100 ) and their ability to remember 16 and recall. Based on our analysis, we establish that a gated recurrent neural 17 network model with long short-term memory units trained using reinforcement 18 learning is powerful and more efficient in temporally consolidating the input spatial 19 information.
    [Show full text]
  • Reinforcement Learning: a Survey
    Journal of Articial Intelligence Research Submitted published Reinforcement Learning A Survey Leslie Pack Kaelbling lpkcsbrownedu Michael L Littman mlittmancsbrownedu Computer Science Department Box Brown University Providence RI USA Andrew W Mo ore awmcscmuedu Smith Hal l Carnegie Mel lon University Forbes Avenue Pittsburgh PA USA Abstract This pap er surveys the eld of reinforcement learning from a computerscience p er sp ective It is written to b e accessible to researchers familiar with machine learning Both the historical basis of the eld and a broad selection of current work are summarized Reinforcement learning is the problem faced by an agent that learns b ehavior through trialanderror interactions with a dynamic environment The work describ ed here has a resemblance to work in psychology but diers considerably in the details and in the use of the word reinforcement The pap er discusses central issues of reinforcement learning including trading o exploration and exploitation establishing the foundations of the eld via Markov decision theory learning from delayed reinforcement constructing empirical mo dels to accelerate learning making use of generalization and hierarchy and coping with hidden state It concludes with a survey of some implemented systems and an assessment of the practical utility of current metho ds for reinforcement learning Intro duction Reinforcement learning dates back to the early days of cyb ernetics and work in statistics psychology neuroscience and computer science In the last ve to ten years
    [Show full text]
  • Markov Decision Processes
    Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato 4. Stacy Marsella Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* But only in deterministic domains... Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? We are going to introduce a new framework for encoding problems w/ stochastic dynamics: the Markov Decision Process (MDP) SEQUENTIAL DECISION- MAKING MAKING DECISIONS UNDER UNCERTAINTY • Rational decision making requires reasoning about one’s uncertainty and objectives • Previous section focused on uncertainty • This section will discuss how to make rational decisions based on a probabilistic model and utility function • Last class, we focused on single step decisions, now we will consider sequential decision problems REVIEW: EXPECTIMAX max • What if we don’t know the outcome of actions? • Actions can fail a b • when a robot moves, it’s wheels might slip chance • Opponents may be uncertain 20 55 .3 .7 .5 .5 1020 204 105 1007 • Expectimax search: maximize average score • MAX nodes choose action that maximizes outcome • Chance nodes model an outcome
    [Show full text]
  • An Energy Efficient Edgeai Autoencoder Accelerator For
    Received 26 July 2020; revised 29 October 2020; accepted 28 November 2020. Date of current version 26 January 2021. Digital Object Identifier 10.1109/OJCAS.2020.3043737 An Energy Efficient EdgeAI Autoencoder Accelerator for Reinforcement Learning NITHEESH KUMAR MANJUNATH1 (Member, IEEE), AIDIN SHIRI1, MORTEZA HOSSEINI1 (Graduate Student Member, IEEE), BHARAT PRAKASH1, NICHOLAS R. WAYTOWICH2, AND TINOOSH MOHSENIN1 1Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore, MD 21250, USA 2U.S. Army Research Laboratory, Aberdeen, MD, USA This article was recommended by Associate Editor Y. Li. CORRESPONDING AUTHOR: N. K. MANJUNATH (e-mail: [email protected]) This work was supported by the U.S. Army Research Laboratory through Cooperative Agreement under Grant W911NF-10-2-0022. ABSTRACT In EdgeAI embedded devices that exploit reinforcement learning (RL), it is essential to reduce the number of actions taken by the agent in the real world and minimize the compute-intensive policies learning process. Convolutional autoencoders (AEs) has demonstrated great improvement for speeding up the policy learning time when attached to the RL agent, by compressing the high dimensional input data into a small latent representation for feeding the RL agent. Despite reducing the policy learning time, AE adds a significant computational and memory complexity to the model which contributes to the increase in the total computation and the model size. In this article, we propose a model for speeding up the policy learning process of RL agent with the use of AE neural networks, which engages binary and ternary precision to address the high complexity overhead without deteriorating the policy that an RL agent learns.
    [Show full text]
  • Sparse Cooperative Q-Learning
    Sparse Cooperative Q-learning Jelle R. Kok [email protected] Nikos Vlassis [email protected] Informatics Institute, Faculty of Science, University of Amsterdam, The Netherlands Abstract in uncertain environments. In principle, it is pos- sible to treat a multiagent system as a `big' single Learning in multiagent systems suffers from agent and learn the optimal joint policy using stan- the fact that both the state and the action dard single-agent reinforcement learning techniques. space scale exponentially with the number of However, both the state and action space scale ex- agents. In this paper we are interested in ponentially with the number of agents, rendering this using Q-learning to learn the coordinated ac- approach infeasible for most problems. Alternatively, tions of a group of cooperative agents, us- we can let each agent learn its policy independently ing a sparse representation of the joint state- of the other agents, but then the transition model de- action space of the agents. We first examine pends on the policy of the other learning agents, which a compact representation in which the agents may result in oscillatory behavior. need to explicitly coordinate their actions only in a predefined set of states. Next, we On the other hand, in many problems the agents only use a coordination-graph approach in which need to coordinate their actions in few states (e.g., two we represent the Q-values by value rules that cleaning robots that want to clean the same room), specify the coordination dependencies of the while in the rest of the states the agents can act in- agents at particular states.
    [Show full text]