Kernel and Moment Based 0.1In Prediction and Planning

Kernel and Moment based Prediction and Planning Applications to Robotics and Natural Language Processing CMU-RI-TR-18-13 February 19, 2018 Zita Marinho Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Geoffrey J. Gordon, Chair Siddhartha S. Srinivasa, CMU/University of Washington André F. T. Martins, Unbabel/IT, Instituto Superior Técnico João P. Costeira, Instituto Superior Técnico Shay B. Cohen, University of Edinburgh Mathew T. Mason, Carnegie Mellon University Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. This research was supported by the Portuguese Foundation for Science and Technology under grant SFRH/BD/52015/2012. © Zita Marinho, 2018 2 zita marinho Keywords— Sequence prediction, Method of Moments, Semi-supervised learning, Predictive State Representations, Reinforcement-Learning 1 Abstract This thesis focuses on moment and kernel-based methods for applications in Robotics and Natural Language Processing. Kernel and moment-based learning leverage information about correlated data that allow the design of compact representations and efficient learning algorithms. We explore kernel algorithms for planning by leveraging inherently continuous properties of reproducing kernel Hilbert spaces. We introduce a kernel based robot motion planner based on gradient optimization, in a space of smooth trajectories, a reproducing kernel Hilbert space. We further study a kernel-based approach in the context of prediction, for learning a generative model, and in the context of planning for learning to interact with a controlled process. Our work on moment-based learning can be decomposed into two main branches: spectral techniques and anchor-based methods. Spectral learning describes a more expressive model, which implicitly uses hidden state variables. We use it as a means to obtain a more expressive predictive model that we can use to learn to control an interactive agent, in the context of reinforcement learning. We propose a combination of predictive representations with deep reinforcement learning to produce a recurrent network that is able to learn continuous policies under partial observability. We introduce an efficient end-to-end learning algorithm that is able to maximize cumu- lative reward while minimizing prediction error. We apply this approach to several continuous observation and action environments. Anchor learning, on the other hand, provides an explicit form of representing state variables, by relating states to unambiguous observations. We rely on anchor-based techniques to provide a form of explicitly recovering the model parameters, in particular when states have a discrete representation such as in many Natural Language Processing tasks. This family of methods provides an easier form of integrating supervised information during the learning process. We apply anchor-based algorithms on word labelling tasks in Natural Language Processing, namely semi-supervised part-of-speech tagging where annotations are learned from a large amount of raw text and a small annotated corpus. Acknowledgements I would like to thank the many people that encouraged me and contributed during the period of my PhD. First of all I would like to thank all of my three academic guiding stars: André Martins, Geoff Gordon and Siddhartha Srinivasa for their incredible support and patience teaching me throughout the PhD. Thanks to their support and guidance I was able to learn from a rich and diverse set of research subjects. André was a very attentive and incredible advisor and I am very grateful to him for always directing me towards the right direction. He was my NLP guiding star. I would like to thank Geoff, my Machine Learning guiding star, for always being present throughout these years and for his insightful advice on many research and practical problems. And lastly, I would like to thank Sidd, my Robotics guiding star, for being truly inspiring and for providing all the support i needed in the PhD, thanks to him I was able to be a part of a inspiring and thriving lab, the Personal Robotics Lab, I would like to thank Mike Koval and Anca Dra- gan for collaborating and working with me during the PhD. I was able to learn a lot from their advice and experience. From the Geoff’s side of the family, I am very grateful to Byron Boots for his guidance in the RKHS project and for teaching me about PSRs, and to Ahmed Hefny for collaborating with me on the RPSP pa- per, his contribution was invaluable to the completion of this work. I am also very thankful to all my other collaborators: Wen Sun, Arun Byravan, and in particular to Gilwoo Lee for her work on learning PWS hybrid systems. Thank you also to Shay Cohen and Matt Mason for being part of my committee for their feedback and discussions on the thesis topics. I would like to thank my academic grandfather Noah Smith for welcoming me in his research group and taking the time to teach my independent study at CMU. Thank you to Prof. João Paulo Costeira for his enormous support and for welcoming me in the SIPG lab, and also all my colleagues at ISR in Lisbon: Shanghang, Qiwei, João Carvalho, João Saúde, Jayakorn, Beatriz, Sérgio, Cláudia and Manuel for the great environment, which I already miss terribly, and also Sabina and Susana for their immense support and friendship. I would like to say a special thank you to Ada Zhang, Renato Negrinho, João Saúde and Alessandro Giordano, for being great friends and house- mates. Thank you also to Priberam for hosting me in my first years of my PhD. This thesis could not have been possible without the support of the Fundação para a Ciência e Tecnolo- gia (FCT) and the Information and Communication Technology Institute (ICTI) through the CMU/Portugal program (SFRH/BD/ 52015/2012). This work has also been partially supported by the European Union under H2020 project SUMMA, grant 688139, and by FCT, through contracts UID/EEA/50008/2013, through the LearnBig project (PTDC/EEISII/7092/2014), and the GoLocal project (grant CMUPERI/TIC/0046/2014). Personally I would like to thank all my friends and family for their incredible support through the best of times and through the worst of times. Andreia and Sofia you have always had a sympathetic ear. Thank you Gil for your support, and André Moita, Cristina and Leonardo for all your advice, and Wang Ling for your great encouragement. 4 I would like to dedicate this thesis to my family, for which I am deeply grateful. And lastly, thank you Ana for being my sister, without your help none of this would have been possible. Contents 1 Introduction 8 1.1 Main contributions 14 1.2 Previous Publications 18 1.3 Thesis organization 18 2 Background 21 2.1 Notation 21 2.2 Models for sequential systems 22 2.3 Latent Variable Models 25 2.4 Latent Variable learning 26 2.5 Spectral learning of sequential systems 37 2.6 Planning approaches 54 2.7 Planning under uncertainty 57 3 Sequence Labeling with Method of Moments 61 3.1 Motivation 61 3.2 Sequence Labeling 63 3.3 Semi-Supervised Learning via Moments 65 3.4 Feature-Based Emissions 70 3.5 Method Improvements 73 3.6 Experiments 74 3.7 Conclusions 78 6 4 Planning with Kernel Methods 80 4.1 Motivation 80 4.2 Trajectories in Reproducing Kernel Hilbert Spaces 82 4.3 Motion Planning in an RKHS 84 4.4 Trajectory Efficiency as Norm Encoding in RKHS 87 4.5 Kernel Metric in RKHS 89 4.6 Cost Functional Analysis 91 4.7 Experimental Results 92 4.8 Conclusions 99 5 Planning with Method of Moments 100 5.1 Motivation 100 5.2 Predictive State Representations 103 5.3 Predictive Reinforcement Learning 104 5.4 Predictive State Representations of Controlled Models 106 5.5 Recurrent Predictive State Policy (RPSP) Networks 108 5.6 Learning Recurrent Predictive State Policies 109 5.7 Connection to RNNs with LSTMs/GRUs 114 5.8 Experiments 115 5.9 Results 116 5.10 Conclusions 121 6 Conclusions 123 6.1 Summary of contributions 123 6.2 Future directions 125 CONTENTS 7 1 Introduction In this chapter, we discus the core aspects of moment and kernel-based learning and summarize the main contributions made in this thesis. Method of Moments Many problems in machine learning attempt to find a compact model that is capable of explaining complex behaviours from observable quan- tities. We focus mostly on sequential observations, where time depen- dence plays a crucial role in the evolution of these events. Latent variable learning provides a compact representation for learning these complex, high dimensional data in a structured form. How- ever, learning latent variables is often difficult, in part because of the non-convexity of the likelihood function (Sun, 2014; Terwijn, 2002). Yet, the main challenge is associated with estimating the latent (un- observed) states, which need to be estimated indirectly by looking at correlations among observations (examples in Figure 1.2. The method of moments (MoM) provides an alternative form for explaining high dimensional and complex data, based on classical statistics and probability theory. MoM dates back to Pearson’s solu- tion for curve fitting problems, i.e., for finding parameters that fit a mixture of two Gaussian distributions (Pearson, 1894). MoM’s esti- mation relies on the idea that empirical moments are “natural” esti- mators of population moments. In essence, learning a model via the MoM comes down to estimating model parameters that represent distributions whose moments are in agreement with sample moments observed in the data. matching moments vs. maximum Moment-based algorithms differ from likelihood-based methods in likelihood that they attempt to recover parameters based on moment matching instead of maximizing likelihood which leads to intractable optimization.

Kernel and Moment Based 0.1In Prediction and Planning

Engineering Applications of Artificial Intelligence Potential, Challenges and Future Directions for Deep Learning in Prognostics

Implicit Kernel Learning

Exploring Induced Pedagogical Strategies Through a Markov Decision Process Frame- Work: Lessons Learned

Multiple-Kernel Learning for Genomic Data Mining and Prediction

Clustering on Optimized Features

Fredholm Multiple Kernel Learning for Semi-Supervised Domain Adaptation

Study of Kernel Machines Towards Brain-Computer Interfaces Tian Xilan

Multiclass Multiple Kernel Learning

Bridging Deep and Kernel Methods

A Multiple Kernel Density Clustering Algorithm for Incomplete Datasets in Bioinformatics Longlong Liao1,2, Kenli Li3*, Keqin Li4, Canqun Yang1,2 and Qi Tian5

LMKL-Net: a Fast Localized Multiple Kernel Learning Solver Via Deep Neural Networks

Large Scale Multiple Kernel Learning