Deep Reinforcement Learning in Inventory Management

Deep Reinforcement Learning in Inventory Management Kevin Geevers Master Thesis Industrial Engineering and Management University of Twente Deep Reinforcement Learning in Inventory Management By Kevin Geevers Supervisor University of Twente Supervisor ORTEC dr.ir. M.R.K Mes L.V. van Hezewijk, MSc. dr. E. Topan December 2020 Management Summary This research is conducted at ORTEC in Zoetermeer. ORTEC is a consultancy firm that is specialized in, amongst other things, routing and data science. ORTEC advises their customers on optimization opportunities for their inventory systems. These recommendations are often based on heuristics or mathematical models. The main limitation of these methods is that they are very case specific, and therefore, have to be tailor-made for every customer. Lately, ORTEC’s interest is gained by reinforcement learning. Reinforcement learning is a method that aims to maximize a reward and interacts with an environment by the means of actions. When an action is completed, the current state of the environment is updated and a reward is given. Reinforcement learning is able to maximize expected future rewards and, because of its sequential decision making, a promising method for inventory management. ORTEC is interested in how reinforcement learning can be used in a multi-echelon inventory system and has defined a customer case on which reinforcement learning can be applied: the multi-echelon inventory system of the CardBoard Company. The CardBoard Company currently has too much stock, but is still not able to meet their target fill rate. Therefore, they are looking for a suiting inventory policy that can reduce their inventory costs. This leads to the following research question: In what way, and to what degree, can a reinforcement learning method be best applied to the multi-echelon inventory system of the CardBoard Company, and how can this model be generalized? Method To answer this question, we first apply reinforcement learning on two toy problems. These are a linear and divergent inventory system from literature, which are easier to solve and implement. This way, we are able to test our method and compare it with literature. We first implement the reinforcement learning method of Chaharsooghi, Heydari, and Zegordi (2008). This method uses Q-learning with a Q-table to determine the optimal order quantities. Our implementation of this method manages to achieve the same result of the paper. However, when taking a closer look at this method, we conclude that it does not succeed in learning the correct values for the state-action pairs, as the problem is too large. We propose three improvements to the algorithm. After implementing these improvements, we see that the method is able to learn the correct Q-values, but does not yield better results than the unimproved Q-learning algorithm. After experimenting with random actions, we conclude that the paper of Chaharsooghi et al. (2008) did not succeed in building a successful reinforcement learning method, but only gained promising results due to their small action space, and, therefore, limited impact of the method. Furthermore, we notice that, with over 60 million cells, the Q-table is already immense and our computer is not able to initialize a larger Q-table. Therefore, we decide to take another method, which is deep reinforcement learning (DRL). DRL uses a neural network to estimate the value function, instead of a table. Hence, this method is more scalable and often defined as a promising method in literature. We chose to implement the Proximal Policy Optimization Algorithm of Schulman, Wolski, Dhariwal, Radford, and Klimov (2017), by adapting the code from the packages ‘Stable Baselines’ and ‘Spinning up’. We define the same hyperparameters as in Schulman et al. (2017) and define the case-specific action and state space ourselves. Next to that, we have decided to use a neural network with a continuous action space. This means that the neural network does not output the probability of a certain action, but outputs the value of the action itself. In our case, this value will correspond to the order quantity that has to be ordered. We chose this continuous action space, as it is more scalable and can be used on large action spaces. i Results With our DRL method, we improve the results of Chaharsooghi et al. (2008). We also apply the DRL method to a divergent inventory system, defined by Kunnumkal and Topaloglu (2011). We apply the method without any modifications of the parameters of the algorithm, but notice that the definition of state and action vector are important for the performance of the algorithm. In order to compare our results, we implement the heuristic of Rong, Atan, and Snyder (2017) that determines the near optimal base-stock parameters for divergent supply chains. We run several experiments and see that our deep reinforcement learning method is able to perform better than the benchmark with a small difference. After two successful implementations of our DRL method, we apply our method to the case of the CardBoard Company. We make some assumptions and simplifications, like the demand distribution and lead times, to be able to implement the case in our simulation. We define several experiments in order to find suitable values for the upper bound of the state and action vector. As benchmark, we reconstruct the current method of CBC. In this method, CBC wants to achieve a fill rate of 98 % for every location. Therefore, we determine the base-stock parameters in such a way that a fill rate of 98% is yielded for every location. In this case, we notice that the method is varying greatly in its results. Some runs are able to perform better than the benchmark, while five out of ten runs are not able to learn correct order quantities. We denote the results of both the best and worst run in the final results, as they are too far apart to give a representative average. The final results are: Total costs of the DRL method and the corresponding benchmarks. Lower is better. Case DRL Benchmark Beer game 2,726 3,259 Divergent 3,724 4,059 CBC 8,402 - 1,252,400 10,467 Conclusion and Recommendations This thesis shows that deep reinforcement learning can successfully be implemented in different cases. To the best of our knowledge, it is the first research that applies a neural network with a continuous action space to the domain of inventory management. Next to that, we apply DRL to a general inventory system, a case that has not been considered before. We are able to perform better than the benchmark on every case. However, for the general inventory system, this result is only gained for three out of 10 runs. To conclude, we recommend ORTEC to: • Not start with using deep reinforcement learning as main solution to customers yet. Rather use the method on the side, to validate how the method performs in comparison with various other cases. • Focus on the explainability of the method. By default, it can be unclear why the DRL method chooses a certain action. However, we show that there are several different ways to gain insights into the method, but these are often case specific. • Look for ways to reduce the complexity of the environment and the deep reinforcement learning method. • Keep a close eye on the developments in the deep reinforcement learning field. ii Preface While writing this preface, I realize that this thesis marks the end of my student life. It has been a while since I have started studying at the University of Twente. Moving to Enschede was a really big step and a bit scary. Looking back, the years flew by. Years in which I have made amazing friendships and had the opportunities to do amazing thing. I had the pleasure to write my thesis at ORTEC. At ORTEC, I had a warm welcome and learned a lot of interesting things. I would like to thank Lotte for this wonderful thesis subject and supervision of my project. Your support and advise really helped me in my research. Without you, I definitely would not have been able to yield these good results. Next to that, I also want to thank the other colleagues of ORTEC. It was nice working with such passionate and intelligent people. I am very happy that my next step will be to start as consultant at ORTEC! Furthermore, I would like to thank Martijn for being my first supervisor. Your knowledge really helped to shape this research. Thanks to your critical input and feedback, the quality of this thesis has increased a lot. I would also like to thank Engin for being my second supervisor. Your knowledge of inventory management really helped to implement the benchmarks and your enthusiasm helped to finish the last part of this research. Lastly, I would like to thank everyone who made studying a real pleasure. Special thanks to my fellow do-group, my roommates, my dispuut and my fellow board members. Thanks to you I have had a great time. Finally, I would like to thank my family and girlfriend for their unconditional support and love. I hope you will enjoy reading this as much as I have enjoyed writing it! Kevin Geevers Utrecht, December 2020 iii Contents Contents Management Summary i Preface iii List of Figures vii List of Tables viii 1 Introduction 1 1.1 Context . .1 1.2 Problem Description . .1 1.3 Case Description . .4 1.4 Research Goal . .5 1.5 Research Questions . .6 2 Current Situation 8 2.1 Classification Method . .8 2.1.1 Classification string . 11 2.2 Classification of CBC . 11 2.2.1 Network specification .

Load more