Simplified Online Q-Learning for LEGO EV3 Robot

Simplified Online Q-Learning for LEGO EV3 Robot Ke Xu∗, Fengge Wuy, and Junsuo Zhaoz Science and Technology on Integrated Information System Laboratory Institute of Software Chinese Academy of Sciences ∗Email: [email protected] yEmail: [email protected] zEmail: [email protected] Abstract—Q-learning is a kind of model-free reinforcement on the Lego’s embedded CPU[5], [6]. The main drawback of learning algorithm which is effective in Robot’s navigation the Lego’s embedded computation resources is its file writer applications. Unfortunately, Lego Mindstorms EV3 robot’s file speed. It will take Lego about 1 minute to write a 20 KB writing speed is sometimes too slow to implement Q-learning file to its memory. When it comes to large state-action space, algorithm. In this paper, an approach is proposed to simplify reinforcement algorithm’s running time will be intolerable. Q-learning discrete value table into a new version that stores From the above discussion, it appears that at present no only one optimum action and its Q-value instead of storing every action’s Q-value in each state. Exploration and contrast algorithms that can deal with large state-action space with experiments show that our algorithm learns much faster than Lego’s embedded CPU have been proposed. the original Q-learning without losing the ability to find a better In this paper, the primary goal is to simplify the RL policy in navigation task. algorithm under thousands of states and make it possible to learn online tasks with Lego’s embedded CPU. As a result, I. INTRODUCTION a simplified online Q-learning(SOQ) algorithm that cost less memory is proposed to make the learning process faster Applying reinforcement learning(RL) to a Robotic control than the original Q-learning algorithm. Finally, a navigation system with limited computational resources, limited memory, experiment tests the performance of SOQ algorithm on a limited I/O etc, is a challenge task. Both conventional RL Lego Mindstorm EV3 robot system. Empirical result indicates algorithms such as Q-learning, SARSA and more advanced that SOQ algorithm can make the robot perform better in methods like gradient descent temporal difference algorithms navigation experiment and reduce learning time at the same have O(n) computation and memory complexity, nevertheless time. they sometimes cost too much for a Robotic control system to learn online tasks[1]. So there is a need for simplifying the The remainder of this paper is organized as follows: current RL approach to improve the performance of current Section II explains the proposed simplified online Q-learning robotic control system. algorithm. Section III describes navigation experiment setup. Section IV describes experimental results. Section V concludes Over recent years the increasing availability of low-cost the paper. programmable robotics system has lead to a growth in the implements of reinforcement learning control experiment. The II. SIMPLIFIED ONLINE Q-LEARNING ALGORITHM Mindstorms robots manufactured by Lego is one of this kind of robotics systems that have relatively low cost, programmable, A. Q-Learning and could accomplish a variety kinds of tasks with their In reinforcement learning, the agent learns a behavior based flexible physical structure[2]. The approaches that implement on its expected reward. In the Q-learning implementation of RL experiments for Lego Mindstorms robots may be classified reinforcement learning, expected rewards are stored in a state- into three categories. The first approach is to use computer sim- action pair array. The elements of this array is called Q-values. ulation before real robot experiment[3]. Simulation is easier to Q-learning is a kind of temporal-difference method that learns set up, less expensive, fast, more convenient to use, and allows an state-action value instead of learning state value. Q-values the user to perform experiments without the risk of damaging are directly related to value function as follows: the robot. However, many unpredictable robot behaviours are V ∗(s) = max Q(s; a) (1) not perfectly modeled by robotics simulation software. The a second approach is to use computer as robots’ brain with wire or wireless connections to robots instead of robots’ embedded CPU and memories[4]. Computer has much larger memories Q-functions have a very important property: a temporal- and much faster CPU which makes implementation of a much difference agent that learns a Q-function does not need a model 0 stronger RL algorithm become possible. The main drawback of the form Pa(s; s ), either for learning or for action selection. of this approach is that the robot is not totally independent. That is to say, it requires no model of state transitions, all Robots will not work if losing the connection with computer. it needs are the Q-values. The update equation for temporal- The third approach is to reduce the computation and memory difference Q-learning is: request of the program. In some wandering, line following and Q(s; a) = Q(s; a) + α(R(s) + γ max Q(s0; a0) − Q(s; a))(2) obstacle avoidance experiments, there are less than 20 states a0 and 10 actions in the program, which are small enough to run where α is a learning rate parameter. Compared to equation robot will be able to memorize former result. Unfortunately, 0 (1), it is obvious that Q-learning could get rid of Pa(s; s ), the robot need about 8 minutes to refresh all the Q-values which means that computation of Q-values does not need state to the memory file. This is intolerable for an online learning transitions model. task. One possible way to improve it is using off-line learning: connect the robot to a computer and use it to compute all tasks. Algorithm 1 Q-Leaning-Agent returns an action[7] But sometimes the learning need to be finished online. For 1: Q; a table of action values index by state and action example, when the robot explores a harsh environment such 2: s; a; r; the previous state, action, and reward, initially null as desert or outer space, it is difficult to do off-line learning. 3: if s is not null then 0 0 4: Q(s; a) Q(s; a) + α(r + maxa0 Q(s ; a ) − Q(s; a)) 5: end if 6: if s0 is terminal then 7: s; a; r null 8: else 0 0 0 0 9: s; a; r s ; arg maxa0 f(Q(s ; a )); r 10: end if 11: return a In the discrete state-action value table implementation of Q-learning, the number of Q-values is jSj × jAj size, which means in each state, every action’s Q-value should be calculate. For example, if there are 1000 states and have 10 possible actions, that will lead to 10000 Q-values to keep. To get the optimum policy, an agent should follow the action whose Fig. 2. State-action Q-values in simplified online Q-learning. Q-value table Q-value is the maximum in each state. In this approach, it keeps every state’s Q-value and its best action. did not come up to any policies, the solution is state-action pair’s maximum value. So this approach is more likely a value iteration rather than a policy iteration algorithm for A simplified online Q-learning is proposed to reduce mem- reinforcement learning. Another Q-learning problem is a so- ory cost to accelerate learning speed. Instead of keep every called exploration-exploitation trade-off problem. Since go action’s Q-value in each state which is shown in algorithm through every action in each state to find their Q-values is 1, algorithm 2 only keep one best action’s Q-value in each impossible in practice, the agent have to exploit the predicted state. When the robot find a better action whose Q-value is rewards stored in the Q-table before it finished exploring all bigger than previous best action, it will replace previous best the state-action reward space with a random possibility. That action to a new one. In this method, there is only one action is to say, at each time step, if the agent randomly choose needs to keep in each state. So the Q-value table’s size will exploitation, it will use the best available action. If not, an be about 8 times smaller than the original Q-value version in action which has not been explored will be selected. the navigation experiment. Now SOQ algorithm need about 1 minute to refresh Q-value table in each iteration of exploration. It is about 8 times faster than the original Q-learning approach. III. NAVIGATION EXPERIMENT A. Lego Mindstorms EV3 Robot The experiment implemented SOQ algorithm in Java lan- guage using the LeJOS implementation for the Lego Mind- storms EV3 robot.[8] Figure 3 shows that the robot has two pedrails and an infrared sensor. Each pedrail can move back and forth freely, so the robot could move forward, backward and turn around. The infrared sensor could detects objects and track the remote infrared beacon. In the experiment, the robot’s Fig. 1. State-action Q-values in original Q-learning. Q-value table keeps goal is to find the remote infrared beacon as fast as possible. every state-action’s Q-value. The detected distance of the infrared sensor beacon mode is said to be 200cm in the EV3 guide book. In the practical measurement, the valid measure distance is about 80cm with B. Simplified Online Q-learning the display number from 0 to 100. And between 80-200cm, The main drawback of the original implementation of the infrared sensor distance display number remains maximum.

Simplified Online Q-Learning for LEGO EV3 Robot

A Zipliner's Delight

Educational Hands-On Testbed Using Lego Robot for Learning Guidance, Navigation, and Control ?

Experiences with the LEGO Mindstorms Throughout The

Environment Mapping Using the Lego Mindstorms NXT and Lejos NXJ

LEGO MINDSTORMS NXT Programming

Lego Mindstorms: an Mbeddr Case Study

Programming LEGO MINDSTORMS with Java Fast Track 407 Index 421 177 LEGO Java Fore.Qxd 4/2/02 5:01 PM Page Xix

LEGOLAND®, Playmobil- Funpark Y Europa-Park®

Lejos NXJ Tutorial

Brick Sorting Revisited*

Parent/Teacher Handbook and Trouble-Shooting Guide

COMP329 Intro to Lejos