Gaussian Processes and Relevance Vector Machines
Total Page:16
File Type:pdf, Size:1020Kb
Learning with Uncertainty – Gaussian Processes and Relevance Vector Machines Joaquin Qui˜nonero Candela Kongens Lyngby 2004 IMM-PHD-2004-135 Technical University of Denmark Informatics and Mathematical Modelling Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673 [email protected] www.imm.dtu.dk IMM-PHD: ISSN 0909-3192 Synopsis Denne afhandling omhandler Gaussiske Processer (GP) og Relevans Vektor Maskiner (RVM), som begge er specialtilfælde af probabilistiske lineære mod- eller. Vi anskuer begge modeller fra en Bayesiansk synsvinkel og er tvunget til at benytte approximativ Bayesiansk behandling af indlæring af to grunde. Først fordi den fulde Bayesianske løsning ikke er analytisk mulig og fordi vi af princip ikke vil benytte metoder baseret p˚asampling. For det andet, som understøtter kravet om ikke at bruge sampling er ønsket om beregningsmæssigt effektive mod- eller. Beregningmæssige besparelser opn˚as ved hjælp af udtyndning: udtyndede modeller har et stort antal parametre sat til nul. For RVM, som vi behan- dler i kapitel 2 vises at det er det specifikke valg af Bayesiank approximation som resulterer i udtynding. Probabilistiske modeller har den vigtige egenskab der kan beregnes prediktive fordelinger istedet for punktformige prediktioner. Det vises ogs˚aat de resulterende undtyndede probabilistiske modeller implicerer ikke-intuitive a priori fordelinger over funktioner, og ultimativt utilstrækkelige prediktive varianser; modellerne er mere sikre p˚asine prediktioner jo længere væk man kommer fra træningsdata. Vi foresl˚ar RVM*, en modificeret RVM som producerer signifikant bedre prediktive usikkerheder. RVM er en speciel klasse af GP, de sidstnævnte giver bedre resultater og er ikke-udtyndede ikke- parametriske modeller. For komplethedens skyld, i kapitel 3 studerer vi en speci- fik familie af approksimationer, Reduceret Rank Gaussiske Processer (RRGP), som tager form af endelige udviddede lineære modeller. Vi viser at Gaussiaske Processer generelt er ækvivalente med uendelige udviddede lineære modeller. Vi viser ogs˚aat RRGP, ligesom RVM lider under utilstraekkelige prediktive varianser. Dette problem løses ved at modificere den klassiske RRGP metode analogt til RVM*. I den sidste del af afhandlingen bevæger vi os til problemstill- inger med usikre input. Disse er indtil nu antaget deterministiske, hvilket er det ii Synopsis gængse. Her udleder vi ligninger for prediktioner ved stokastiske input med GP og RVM og bruger dem til at propagere usikkerheder rekursivt multiple skridt frem for tidsserie prediktioner. Det tilader os at beregne fornuftige usikkerheder ved rekursiv prediktion k skridt frem i tilfælde hvor standardmetoder som ignor- erer den akkumulerende usikkerhed vildt overestimerer deres konfidens. Til slut undersøges et meget sværere problem: træning med usikre inputs. Vi undersøger den fulde Bayesianske løsning som involverer et uløseligt integral. Vi foresl˚ar to preliminære løsninger. Den første forsøger at “gætte” de ukendte “rigtige” inputs, og kræver finjusteret optimering for at undg˚aoverfitning. Den kræver ogs˚aa priori viden af output støjen, hvilket er en begrænsning. Den anden metode beror p˚asampling fra inputenes a posterior fordeling og optimisering af hyperparametrene. Sampling har som bivirkning en kraftig forøgelse af bereg- ningsarbejdet, som igen er en begrænsning. Men, success p˚a legetøjseksempler er opmuntrende og skulle stimulere fremtidig forskning. Summary This thesis is concerned with Gaussian Processes (GPs) and Relevance Vector Machines (RVMs), both of which are particular instances of probabilistic linear models. We look at both models from a Bayesian perspective, and are forced to adopt an approximate Bayesian treatment to learning for two reasons. The first reason is the analytical intractability of the full Bayesian treatment and the fact that we in principle do not want to resort to sampling methods. The second reason, which incidentally justifies our not wanting to sample, is that we are interested in computationally efficient models. Computational efficiency is obtained through sparseness: sparse linear models have a significant num- ber of their weights set to zero. For the RVM, which we treat in Chap. 2, we show that it is precisely the particular choice of Bayesian approximation that enforces sparseness. Probabilistic models have the important property of producing predictive distributions instead of point predictions. We also show that the resulting sparse probabilistic model implies counterintuitive priors over functions, and ultimately inappropriate predictive variances; the model is more certain about its predictions, the further away from the training data. We pro- pose the RVM*, a modified RVM that provides significantly better predictive uncertainties. RVMs happen to be a particular case of GPs, the latter having superior performance and being non-sparse non-parametric models. For com- pleteness, in Chap. 3 we study a particular family of approximations to Gaussian Processes, Reduced Rank Gaussian Processes (RRGPs), which take the form of finite extended linear models; we show that GPs are in general equivalent to infinite extended linear models. We also show that RRGPs result in degenerate GPs, which suffer, like RVMs, of inappropriate predictive variances. We solve this problem in by proposing a modification of the classic RRGP approach, in the same guise as the RVM*. In the last part of this thesis we move on to the iv problem of uncertainty in the inputs. Indeed, these were until now considered deterministic, as it is common use. We derive the equations for predicting at an uncertain input with GPs and RVMs, and use this to propagate the un- certainty in recursive multi-step ahead time-series predictions. This allows us to obtain sensible predictive uncertainties when recursively predicting k-steps ahead, while standard approaches that ignore the accumulated uncertainty are way overconfident. Finally we explore a much harder problem: that of training with uncertain inputs. We explore approximating the full Bayesian treatment, which implies an analytically intractable integral. We propose two preliminary approaches. The first one tries to “guess” the unknown “true” inputs, and re- quires careful optimisation to avoid over-fitting. It also requires prior knowledge of the output noise, which is limiting. The second approach consists in sampling from the inputs posterior, and optimising the hyperparameters. Sampling has the effect of severely incrementing the computational cost, which again is lim- iting. However, the success in toy experiments is exciting, and should motivate future research. Preface This thesis was prepared partly at Informatics Mathematical Modelling, at the Technical University of Denmark, partly at the Department of Computer Sci- ence, at the University of Toronto, and partly at the Max Planck Institute for Biological Cybernetics, in T¨ubingen, Germany, in partial fulfilment of the requirements for acquiring the Ph.D. degree in engineering. The thesis deals with probabilistic extended linear models for regression, un- der approximate Bayesian learning schemes. In particular the Relevance Vector Machine and Gaussian Processes are studied. One focus is guaranteeing compu- tational efficiency while at the same implying appropriate priors over functions. The other focus is to deal with uncertainty in the inputs, both at test and at training time. The thesis consists of a summary report and a collection of five research papers written during the period 2001–2004, and elsewhere published. T¨ubingen, May 2004 Joaquin Qui˜nonero Candela vi Papers included in the thesis [B] Joaquin Qui˜nonero-Candela and Lars Kai Hansen. (2002). Time Series Prediction Based on the Relevance Vector Machine with Adaptive Kernels. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2002, volume 1, pages 985-988, Piscataway, New Jersey, IEEE. This paper was awarded an oral presentation. [C] Joaquin Qui˜nonero-Candela and Ole Winther. (2003). Incremental Gaus- sian Processes. In Becker, S., Thrun, S., and Obermayer, L., editors, Advances in Neural Information Processing Systems 15, pages 1001–1008, Cambridge, Massachussetts. MIT Press. [D] Agathe Girard, Carl Edward Rasmussen, Joaquin Qui˜nonero-Candela and Roderick Murray-Smith. (2003). Gaussian Process with Uncertain Inputs - Application to Multiple-Step Ahead Time-Series Forecasting. In Becker, S., Thrun, S., and Obermayer, L., editors, Advances in Neural Information Processing Systems 15, pages 529–536, Cambridge, Massachussetts. MIT Press. This paper was awarded an oral presentation. [E] Joaquin Qui˜nonero-Candela and Agathe Girard and Jan Larsen and Carl E. Rasmussen. (2003). Propagation of Uncertainty in Bayesian Kernels Models - Application to Multiple-Step Ahead Forecasting. In Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 2, pages 701–704, Piscataway, New Jersey, IEEE. This paper was awarded an oral presentation. [F] Fabian Sinz, Joaquin Qui˜nonero-Candela, Goekhan Bakır, Carl Edward Rasmussen and Matthias O. Franz. (2004). Learning Depth from Stereo. To appear in Deutsche Arbeitsgemeinschaft f¨ur Mustererkennung (DAGM) Pattern Recognition Symposium 26, Heidelberg, Germany. Springer. viii Acknowledgements Before anything, I would like to thank my advisors, Lars Kai Hansen and Carl Edward Rasmussen, for having given me the opportunity to do this PhD. I have received great support and at the same time been given large freedom to conduct