Simplified Online Q-Learning for EV3 Robot

Ke Xu∗, Fengge Wu†, and Junsuo Zhao‡ Science and Technology on Integrated Information System Laboratory Institute of Software Chinese Academy of Sciences ∗Email: [email protected] †Email: [email protected] ‡Email: [email protected]

Abstract—Q-learning is a kind of model-free reinforcement on the Lego’s embedded CPU[5], [6]. The main drawback of learning algorithm which is effective in Robot’s navigation the Lego’s embedded computation resources is its file writer applications. Unfortunately, EV3 robot’s file speed. It will take Lego about 1 minute to write a 20 KB writing speed is sometimes too slow to implement Q-learning file to its memory. When it comes to large state-action space, algorithm. In this paper, an approach is proposed to simplify reinforcement algorithm’s running time will be intolerable. Q-learning discrete value table into a new version that stores From the above discussion, it appears that at present no only one optimum action and its Q-value instead of storing every action’s Q-value in each state. Exploration and contrast algorithms that can deal with large state-action space with experiments show that our algorithm learns much faster than Lego’s embedded CPU have been proposed. the original Q-learning without losing the ability to find a better In this paper, the primary goal is to simplify the RL policy in navigation task. algorithm under thousands of states and make it possible to learn online tasks with Lego’s embedded CPU. As a result, I.INTRODUCTION a simplified online Q-learning(SOQ) algorithm that cost less memory is proposed to make the learning process faster Applying reinforcement learning(RL) to a Robotic control than the original Q-learning algorithm. Finally, a navigation system with limited computational resources, limited memory, experiment tests the performance of SOQ algorithm on a limited I/O etc, is a challenge task. Both conventional RL Lego Mindstorm EV3 robot system. Empirical result indicates algorithms such as Q-learning, SARSA and more advanced that SOQ algorithm can make the robot perform better in methods like gradient descent temporal difference algorithms navigation experiment and reduce learning time at the same have O(n) computation and memory complexity, nevertheless time. they sometimes cost too much for a Robotic control system to learn online tasks[1]. So there is a need for simplifying the The remainder of this paper is organized as follows: current RL approach to improve the performance of current Section II explains the proposed simplified online Q-learning robotic control system. algorithm. Section III describes navigation experiment setup. Section IV describes experimental results. Section V concludes Over recent years the increasing availability of low-cost the paper. programmable robotics system has lead to a growth in the implements of reinforcement learning control experiment. The II.SIMPLIFIED ONLINE Q-LEARNING ALGORITHM Mindstorms robots manufactured by Lego is one of this kind of robotics systems that have relatively low cost, programmable, A. Q-Learning and could accomplish a variety kinds of tasks with their In reinforcement learning, the agent learns a behavior based flexible physical structure[2]. The approaches that implement on its expected reward. In the Q-learning implementation of RL experiments for Lego Mindstorms robots may be classified reinforcement learning, expected rewards are stored in a state- into three categories. The first approach is to use computer sim- action pair array. The elements of this array is called Q-values. ulation before real robot experiment[3]. Simulation is easier to Q-learning is a kind of temporal-difference method that learns set up, less expensive, fast, more convenient to use, and allows an state-action value instead of learning state value. Q-values the user to perform experiments without the risk of damaging are directly related to value function as follows: the robot. However, many unpredictable robot behaviours are V ∗(s) = max Q(s, a) (1) not perfectly modeled by robotics simulation software. The a second approach is to use computer as robots’ brain with wire or wireless connections to robots instead of robots’ embedded CPU and memories[4]. Computer has much larger memories Q-functions have a very important property: a temporal- and much faster CPU which makes implementation of a much difference agent that learns a Q-function does not need a model 0 stronger RL algorithm become possible. The main drawback of the form Pa(s, s ), either for learning or for action selection. of this approach is that the robot is not totally independent. That is to say, it requires no model of state transitions, all Robots will not work if losing the connection with computer. it needs are the Q-values. The update equation for temporal- The third approach is to reduce the computation and memory difference Q-learning is: request of the program. In some wandering, line following and Q(s, a) = Q(s, a) + α(R(s) + γ max Q(s0, a0) − Q(s, a))(2) obstacle avoidance experiments, there are less than 20 states a0 and 10 actions in the program, which are small enough to run where α is a learning rate parameter. Compared to equation robot will be able to memorize former result. Unfortunately, 0 (1), it is obvious that Q-learning could get rid of Pa(s, s ), the robot need about 8 minutes to refresh all the Q-values which means that computation of Q-values does not need state to the memory file. This is intolerable for an online learning transitions model. task. One possible way to improve it is using off-line learning: connect the robot to a computer and use it to compute all tasks. Algorithm 1 Q-Leaning-Agent returns an action[7] But sometimes the learning need to be finished online. For 1: Q, a table of action values index by state and action example, when the robot explores a harsh environment such 2: s, a, r, the previous state, action, and reward, initially null as desert or outer space, it is difficult to do off-line learning. 3: if s is not null then 0 0 4: Q(s, a) ← Q(s, a) + α(r + maxa0 Q(s , a ) − Q(s, a)) 5: end if 6: if s0 is terminal then 7: s, a, r ← null 8: else 0 0 0 0 9: s, a, r ← s , arg maxa0 f(Q(s , a )), r 10: end if 11: return a

In the discrete state-action value table implementation of Q-learning, the number of Q-values is |S| × |A| size, which means in each state, every action’s Q-value should be calculate. For example, if there are 1000 states and have 10 possible actions, that will lead to 10000 Q-values to keep. To get the optimum policy, an agent should follow the action whose Fig. 2. State-action Q-values in simplified online Q-learning. Q-value table Q-value is the maximum in each state. In this approach, it keeps every state’s Q-value and its best action. did not come up to any policies, the solution is state-action pair’s maximum value. So this approach is more likely a value iteration rather than a policy iteration algorithm for A simplified online Q-learning is proposed to reduce mem- reinforcement learning. Another Q-learning problem is a so- ory cost to accelerate learning speed. Instead of keep every called exploration-exploitation trade-off problem. Since go action’s Q-value in each state which is shown in algorithm through every action in each state to find their Q-values is 1, algorithm 2 only keep one best action’s Q-value in each impossible in practice, the agent have to exploit the predicted state. When the robot find a better action whose Q-value is rewards stored in the Q-table before it finished exploring all bigger than previous best action, it will replace previous best the state-action reward space with a random possibility. That action to a new one. In this method, there is only one action is to say, at each time step, if the agent randomly choose needs to keep in each state. So the Q-value table’s size will exploitation, it will use the best available action. If not, an be about 8 times smaller than the original Q-value version in action which has not been explored will be selected. the navigation experiment. Now SOQ algorithm need about 1 minute to refresh Q-value table in each iteration of exploration. It is about 8 times faster than the original Q-learning approach.

III.NAVIGATION EXPERIMENT A. Lego Mindstorms EV3 Robot The experiment implemented SOQ algorithm in Java lan- guage using the LeJOS implementation for the Lego Mind- storms EV3 robot.[8] Figure 3 shows that the robot has two pedrails and an infrared sensor. Each pedrail can move back and forth freely, so the robot could move forward, backward and turn around. The infrared sensor could detects objects and track the remote infrared beacon. In the experiment, the robot’s Fig. 1. State-action Q-values in original Q-learning. Q-value table keeps goal is to find the remote infrared beacon as fast as possible. every state-action’s Q-value. The detected distance of the infrared sensor beacon mode is said to be 200cm in the EV3 guide book. In the practical measurement, the valid measure distance is about 80cm with B. Simplified Online Q-learning the display number from 0 to 100. And between 80-200cm, The main drawback of the original implementation of the infrared sensor distance display number remains maximum. discrete state-action value table Q-learning is the file writing In fact, the distance display number and the real distance are speed. Compared to figure 1, from figure 2 we can see that not strictly linear correlation. The sensitivity becomes worse after every iteration of exploration, the robot has a new group when the distance goes around 80cm, which is a common of Q-value and it need to be written into a memory file, so the phenomenon for a practical sensor. Algorithm 2 Simplified online Q-Leaning-Agent returns an and action 3 had the fastest move speed. So for each state, action there were 4 × 4 = 16 actions. The total number of Q-values 1: Q, a table of action values index by state was 4641 × 16 = 74256. 2: s, a, r, the previous state, action, and reward, initially null The initial policy depended on the angle. If the abstract 3: trade-off exploration-exploitation by possibility of P . angle was less than 5, the robot would move forward with the 4: if s is not null then speed level of 2. If the abstract angle was less than 15 and 5: randomly choose a float number p from 0 to 1. bigger than 5, the robot would turn to the middle with one 6: if p < P then pedrail’s speed level of 2 and the other of 1. And if the abstract 7: exploit current optimum action a. angle was bigger than 15, the speed level would be 2 and 8: else 0. This policy could make the robot find the remote infrared 9: explore another action a. beacon with a medium speed. A bad learning algorithm could 10: end if make the robot slower. 11: Q(s, a) ← r + Q(s0, a0) 12: if Q(s, a) > maxa Q(s, a) then In control program, the state and output actions were 13: maxa Q(s, a) ← Q(s, a) updated every 0.1 second using timing interrupt. After every 0 14: end if action, the immediate reward Ra(s, s ) = −0.01, which means 15: end if the robot would be punished by -0.01 every 0.1 second. If 16: if s’ is terminal then the robot found the beacon, it would have the reward of 1. 17: s, a, r ← null After implementing the initial policy, the robot could finish the 18: else experiment about 7 to 8 second. Through this fact, all initial 0 0 0 0 19: s, a, r ← s , arg maxa0 f(Q(s , a )), r Q-values could be decided depending on the display distance 20: end if numbers. 21: return a TABLE I. INITIAL Q-VALUES BY DISTANCE distance 11 12 ... 99 100 Q-value 0.99 0.98 ... 0.11 0.10

In table I, the initial definition of Q-value meant the robot would move to the beacon by the distance of 1 every 0.1 second. It could be seen as a prior knowledge of the robot to make the algorithm converge faster and a random initial Q-value setting would also be fine. It was a little slower than the initial policy. So when the robot go through any state, Q- value would be changed slightly bigger than the initial value. It could make us find out which state the robot has reached.1

IV. EXPERIMENTAL RESULTS Fig. 3. EV3 Robot with two pedrails and an infrared sensor. In the experiment, exploration-exploitation trade-off possi- bility was set to be 0.94. It meant the robot explores new action with possibility of 0.06, which to make sure the robot would To decide remote beacon’s position on the two dimension not be too aggressive to lose its direction. After 50 iterations ground, another dimension of data except the distance is also of experiments, the robot updated 998 Q-values, which was needed. The beacon mode provide the angle from the infrared ◦ 21.7% of total Q-values. After 10 times of contrast experiment, sensor to the remote beacon. The angle range is between −25 RL policy was about 4% faster than original policy. to 25◦. Now with the data of distance and angle, the remote beacon could be located from the sensor and find it with pedrails.

B. Experimental Setup The robot started from about 80cm far from the remote infrared beacon and within −25◦ to 25◦ range. The robot had an initial policy to make sure the robot could find the beacon anyway. Then the robot would explore new actions and find a faster policy. The definition of finding beacon was the distance Fig. 4. Q-value table display on EV3 LCD. display number less than 10. So this was also the final state of the experiment. Now between the initial state and final In figure 4, a Q-value block included two kinds of data that state, the algorithm discretized distance with 91 from 10cm split by a comma. First data were the best action in that state, to 100cm, and discretized angle with 51 form −25◦ to 25◦. So the sum of the state was 91 × 51 = 4641. In each state, 1Source code of SOQ algorithm could be found at:https://github.com/ actions for each pedrail had 4 level. Action level 0 means stop xkcoke/Simplified-Online-Q-learning. and second data were the Q-value of that action. For example, in the red block, ”23” means left pedrail’s speed was level 2 and right pedrail’s speed was level 3. The yellow blocks were initial actions and Q-values. They meant the states are not reached by robot. The pink block had initial action and new Q-value. It meant the robot has reached the state with an initial action. The red block had new a action and Q-value. It meant the robot had reached the state with new action. Compared to original Q-learning approach, SOQ algorithm stored only one Q-value and a better action in each state.

Fig. 6. Contrast experiment. The blue line shows initial policy’s performance and the orange line shows reinforced policy’s performance.

of exploration experiment. In the contrast experiment, the robot always trembled when running reinforced policy. The reason is new actions are discrete and like many genic mutations in the Q-value table. When the exploration experiment goes near convergence the robot will act more like continuous.

V. CONCLUSION Simplified discrete state-action value table implementation (a) new-reward of Q-learning with one optimum action and its Q-value in each state reduced file writting cost of online Q-learning running on Lego Mindstorms EV3 robot. Unlike related works had no more than 20 states and 10 actions in their algorithms, SOQ algorithm make it much faster to deal with thousands of state- action values. The contrast experiment showed that SOQ algorithm ap- proach could improve navigation performance of the robot. When robotic task requires large space of states and actions, memorize best actions instead of memorize every action could be one possible way to accelerate learning algorithm.

ACKNOWLEDGMENT

(b) new-action This work is supported by National Natural Science Foun- dation of China(No.61202218). Fig. 5. Exploration Experiment Result. This figure shows the increment of new rewards and new actions found by SOQ algorithm during the experiment. REFERENCES [1] Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, Figure 5 showed the exploration iterations of the robot. David Silver, Csaba Szepesvri, and Eric Wiewiora. Fast gradient- New rewards in subfigure 5(a) were second data in figure 4, descent methods for temporal-difference learning with linear function approximation. Danyluk Et, pages 993–1000, 2009. meant state numbers that the robot had explored and the Q- [2] Lego mindstorms. http://mindstorms.lego.com/. retrieved July 18, values had been modified. New actions in subfigure 5(b) were 2015. first data in figure 4, meant the numbers of new better actions [3] Y. Wicaksono, Ry Khoswanto, and Son Kuswadi. Behaviors coordination had been found by the robot. Remote infrared beacon was and learning on autonomous navigation of physical robot. Telkomnika, placed at the left side in front of the robot about 80cm and 9(3):473–482, 2011. ◦ 20 away. Although the initial conditions were almost same [4] V´ıctor Ricardo Cruz-Alvarez,´ Enrique Hidalgo-Pena,˜ and Hector-Gabriel in each exploration, the robot was able to find new state that Acosta-Mesa. A line follower robot implementation using mind- it has never reached. The robot was also able to find better storms kit and q-learning. Acta Universitaria, 22(0), 2012. actions in most iterations. [5] Gabriel J. Ferrer. Encoding robotic sensor states for q-learning using the self-organizing map. Journal of Computing Sciences in Colleges, To test the result of learning, the reinforced policy and 25:133–139, 2010. initial policy were compared with the exact same navigation [6] Angel´ Mart´ınez-Tenor, Juan-Antonio Fernandez-Madrigal,´ and Ana c task. Figure 6 showed the result of 10 times of contrast Cruz-Mart´ın. Lego mindstorms nxt and q-learning: a teaching ap- experiment. proach for robotics in engineering. 2014. [7] Stuart Russell and Peter Norvig. Artificial intelligence: A modern Although the improvement is limit, the new policy is approach. 2009. obviously better than the original one. It still has potential [8] Lejos, java for legomindstorms. http://lejos.sourceforge.net/. retrieved because the RL process is far from convergence after 50 times July 18, 2015.