Reinforcement Learning Using Local Adaptive Models

Linköping Studies in Science and Technology Thesis No. 507 Reinforcement Learning Using Local Adaptive Models Magnus Borga LIU-TEK-LIC-1995:39 Department o f Electrical Engineering Linköpings universitet, SE-581 83 Linköping, Sweden Reinforcement Learning Using Local Adaptive Models © 1995 Magnus Borga Department of Electrical Engineering Linköpings universitet SE-581 83 Linköping Sweden ISBN 91-7871-590-3 ISSN 0280-7971 Abstract In this thesis, the theory of reinforcement learning is described and its relation to learning in biological systems is discussed. Some basic issues in reinforcement learning, the credit assignment problem and perceptual aliasing, are considered. The methods of temporal difference are described. Three important design issues are discussed: information representation and system architecture, rules for improving the behaviour and rules for the reward mechanisms. The use of local adaptive models in reinforcement learning is suggested and exempli®ed by some experiments. This idea is behind all the work presented in this thesis. A method for learning to predict the reward called the prediction matrix memory is presented. This structure is similar to the correlation matrix memory but differs in that it is not only able to generate responses to given stimuli but also to predict the rewards in reinforcement learning. The prediction matrix memory uses the channel representation, which is also described. A dynamic binary tree structure that uses the prediction matrix memories as local adaptive models is presented. The theory of canonical correlation is described and its relation to the generalized eigenproblem is discussed. It is argued that the directions of canonical correlations can be used as linear models in the input and output spaces respectively in order to represent input and output signals that are maximally correlated. It is also argued that this is a better representation in a response generating system than, for example, principal component analysis since the energy of the signals has nothing to do with their importance for the response generation. An iterative method for ®nding the canonical correlations is presented. Finally, the possibility of using the canonical correlation for response generation in a reinforcement learning system is indicated. ii To Maria Copyright c 1995 by Magnus Borga Acknowledgements Although I have spent most of the time by my self this summer writing this thesis, there are a number of people who have made it possible for me to realize this work. First of all, I would like to thank all my colleagues at the Computer Vision Laboratory for their professional help in the scienti®c work and, most of all, for being good friends. Many thanks to my supervisor Dr. Hans Knutsson whose ideas, inspiration and enthusi- asm is behind most of the work presented in this thesis. I also thank professor GÈosta Granlund for running this research group and letting me be a part of it. Among my other colleagues I would in particular like to thank Tomas Landelius and JÈorgen Karlholm for spending a lot of time discussing ideas and problems with me, for reading and constructively criticizing the manuscript and for giving me a lot of good ref- erences. A special thank to Tomas for providing me with interesting literature. Slainte! I would also like to thank Dr. Mats Andersson for his kind help with a lot of technical details which are not described in this thesis and Dr. Carl-Johan Westelius for the beer supply that eased the pain of writing during the warmest days of the summer. I should also express my gratitude to Niclas Wiberg who evoked my interest in neural networks. I gratefully acknowledgethe ®nancial support from NUTEK, the Swedish National Board for Industrial and Technical Development. Finally I would like to thank my wife, Maria, for her love and support, for spending all her summer vacation here in the city with me instead of being at the summer cottage and for proof-reading my manuscript and giving constructive critique of the language in my writing. Remaining errors are to be blamed on me, due to ®nal changes. ªWe learn as we do and we do as well as we have learned.º - Vernon B. Brooks Contents 1 INTRODUCTION 1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.1 Background : 1 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.2 The thesis : 2 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.3 Notation : 3 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1.4 Learning : 5 : : : : : : : : : : : : : : : : : : : : : : : 1.4.1 Machine learning : 6 : : : : : : : : : : : : : : : : : : : : : : : : 1.4.2 Neural networks : 7 2 REINFORCEMENT LEARNING 13 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 2.1 Introduction : 13 : : : : : : : : : : : : : : : : : : 2.2 Problems in reinforcement learning : 16 : : : : : : : : : : : : : : : : : : : : : : : 2.2.1 Perceptual aliasing : 16 : : : : : : : : : : : : : : : : : : : : : : : 2.2.2 Credit assignment : 18 : : : : : : : : : : : : : : : : 2.3 Learning in an evolutionary perspective : 19 : : : : : : : : : : : : : : : : 2.4 Design issues in reinforcement learning : 22 : : : : : : 2.4.1 Information representation and system architecture : 22 : : : : : : : : : : : : : : : 2.4.2 Rules for improving the behaviour : 24 : : : : : : : : : : : : : : : 2.4.3 Rules for the reward mechanisms : 27 : : : : : : : : : : : : : : : : 2.5 Temporal difference and adaptive critic : 29 : : : : : : : : : : : : : : : : : : : : : : : 2.5.1 A classic example : 33 : : : : : : : : : : : : : : : : : : : : : : : : : 2.6 Local adaptive models : 34 : : : : : : : : : : : 2.6.1 Using adaptive critic in two-level learning : 37 : : : : : : : : : : : : : : : : : : : : : 2.6.2 The badminton player : 38 3 PREDICTION MATRIX MEMORY 43 : : : : : : : : : : : : : : : : : : : : : : : 3.1 The channel representation : 43 : : : : : : : : : : : : : : : : : : : : : : : : : 3.2 The order of a function : 48 : : : : : : : : : : : : : : : : : : : : 3.3 The outer product decision space : 48 : : : : : : : : : : : : : : : : : : : : : : 3.4 Learning the reward function : 49 : : : : : : : : : : : : : : : : : : : : 3.4.1 The Learning Algorithm : 51 : : : : : : : : : : : : : : : : : : : : 3.4.2 Relation to TD-methods : 53 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.5 A tree structure : 54 : : : : : : : : : : : : : : : : : : : : : 3.5.1 Generating responses : 54 : : : : : : : : : : : : : : : : : : : : : : : : 3.5.2 Growing the tree : 57 vii viii : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.6 Experiments : 60 : : 3.6.1 The performance when using different numbers of channels : 61 : : : : : : : : : : : : : : : : : : : 3.6.2 The use of a tree structure : 62 : : : : : : : : : : : : : : : : : : : : : : : 3.6.3 The XOR-problem : 63 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.7 Conclusions : 65 4 CANONICAL CORRELATION 67 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.1 Introduction : 67 : : : : : : : : : : : : : : : : : : : : : 4.2 The generalized eigenproblem : 69 : : : : : : : : : : : : : : : : : : : : : 4.2.1 The Rayleigh quotient : 69 : : : : : : : : : : : : : : : : : : 4.2.2 Finding successive solutions : 71 : : : : : : : : : : : : : : : 4.2.3 Statistical properties of projections : 72 : : : : : : : : : : : : : : : : : : : : : 4.3 Learning canonical correlation : 76 : : : : : : : : : 4.3.1 Learning the successive canonical correlations : 77 : : : : : : : : : : : : : : : : : : : : : : : : : : 4.4 Generating responses : 79 : : : : : : : : : : : : : : 4.5 Relation to a least mean square error method : 82 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.6 Experiments : 83 : : : : : : : : : : : : : : : : : : : : : 4.6.1 Unsupervised learning : 84 : : : : : : : : : : : : : : : : : : : : 4.6.2 Reinforcement learning : 87 4.6.3 Combining canonical correlation and the prediction matrix mem- : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ory : 89 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.7 Conclusions : 92 5 SUMMARY 95 A PROOFS 97 : : : : : : : : : : : : : : : : : : : : : : : : : : A.1 Proofs for chapter 3 : 97 : : : : : : : : : : : : : : A.1.1 The constant norm of the channel set : 97 : : : : : : : : : A.1.2 The constant norm of the channel derivatives : 99 : : : : : : : : : : : : A.1.3 Derivation of the update rule (eq. 3.17) : 99 : : : : : : : : : : : : A.1.4 The change of the father node (eq. 3.36) : 100 : : : : : : : : : : : : : : : : : : : : : : : : : : A.2 Proofs for chapter 4 : 100 : : : : : : : : A.2.1 The derivative of the Rayleigh quotient (eq. 4.3) : 100 : : : : A.2.2 Orthogonality in the metrics A and B (eqs. 4.5 and 4.6) : 100 : : : : : : : : : : : : : : : : : : : : : : A.2.3 Linear independence : 101 : : : : : : : : : : : : : : : : : : : : : A.2.4 The range of r (eq. 4.7) 101 : : : : : : : : : : : : : : : A.2.5 The second derivative of r (eq. 4.8) 102 : : : : : : : : : : A.2.6 Positive eigenvalues of the Hessian (eq. 4.9) : 102 : : : : : : : : : : : : : A.2.7 The successive eigenvalues (eq. 4.10) : 103 : : : : : : : : : : A.2.8 The derivative of the covariance (eq. 4.15) : 103 : : : : : A.2.9 The derivatives of the correlation (eqs. 4.22 and 4.23) : 104 : : : : : : : : : : : : : : : : : : : : A.2.10 The equivalence of r and 104 : : A.2.11 Invariance with respect to linear transformations (page 85) : 105 ix C C : : : : : : yy A.2.12 Orthogonality in the metrics xx and (eq. 4.40) 106 : : : : A.2.13 The equilibrium

Load more