
Probabilistic machine learning: foundations and frontiers Zoubin Ghahramani1;2 1 University of Cambridge 2 Uber AI Labs [email protected] http://mlg.eng.cam.ac.uk/zoubin/ https://www.uber.com/info/ailabs/ MLSS 2017 Tübingen An exciting time for Machine Learning! Zoubin Ghahramani 2 / 51 APPLICATIONS OF MACHINE LEARNING Speech and Language Technologies, Computer Vision: Scientific Data Analysis automatic speech recognition, (e.g. Bioinformatics, Astronomy) machine translation, question-answering, Object, Face and Handwriting dialog systems Recognition, Image Captioning Recommender Systems Self-driving cars Financial Prediction and Automated Trading Zoubin Ghahramani 3 / 51 Computer Games DeepMind’s system learning to play ATARI games at super-human level Computer Games GAMES Zoubin Ghahramani 4 / 51 ML: MANY METHODS WITH MANY LINKS TO OTHER FIELDS Kernel Methods, SVMs, Gaussian Reinforcement Learning Processes Symbolic Systems, Control Theory Logic-based (ILP), Decision Making Relational Learning Probabilistic Models Bayesian Methods Neural Networks Deep Learning Graphical Models Causal Inference Decision Trees Unsupervised Learning: Random Forests Feature Discovery Clustering Linear Statistical Models Dim. Reduction Logistic Regression Zoubin Ghahramani 5 / 51 Neural networks and deep learning Zoubin Ghahramani 6 / 51 NEURAL NETWORKS Neural networks are tunable nonlinear func- y tions with many parameters. outputs weights Parameters θ are weights of neural net. (n) (n) hidden Neural nets model p(y x ; θ) as a nonlin- units j ear function of θ and x, e.g.: weights (n) (n) (n) inputs p(y = 1 x ; θ) = σ( θixi ) x j Xi Multilayer neural networks model the overall function as a composition of functions (layers), e.g.: (n) (2) (1) (n) (n) y = θj σ( θji xi ) + Xj Xi Usually trained to maximise likelihood (or penalised likelihood) using variants of stochastic gradient descent (SGD) optimisation. NN = nonlinear function + basic stats + basic optimisation Zoubin Ghahramani 7 / 51 DEEP LEARNING Deep learning systems are neural network models similar to those popular in the ’80s and ’90s, with: I some architectural and algorithmic innovations (e.g. many layers, ReLUs, dropout, LSTMs) I vastly larger data sets (web-scale) I vastly larger-scale compute resources (GPU, cloud) I much better software tools (Theano, Torch, TensorFlow) I vastly increased industry investment and media hype figure from http://www.andreykurenkov.com/ Zoubin Ghahramani 8 / 51 LIMITATIONS OF DEEP LEARNING Neural networks and deep learning systems give amazing performance on many benchmark tasks but they are generally: I very data hungry (e.g. often millions of examples) I very compute-intensive to train and deploy (cloud GPU resources) I poor at representing uncertainty I easily fooled by adversarial examples I finicky to optimise: non-convex + choice of architecture, learning procedure, initialisation, etc, require expert knowledge and experimentation I uninterpretable black-boxes, lacking in trasparency, difficult to trust Zoubin Ghahramani 9 / 51 Beyond deep learning Zoubin Ghahramani 10 / 51 TOOLBOX VS MODELLING VIEWS OF MACHINE LEARNING I Machine Learning is a toolbox of methods for processing data: feed the data into one of many possible methods; choose methods that have good theoretical or empirical performance; make predictions and decisions I Machine Learning is the science of learning models from data: define a space of possible models; learn the parameters and structure of the models from data; make predictions and decisions Zoubin Ghahramani 11 / 51 MACHINE LEARNING AS PROBABILISTIC MODELLING I A model describes data that one could observe from a system I If we use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model... I ...then inverse probability (i.e. Bayes rule) allows us to infer unknown quantities, adapt our models, make predictions and learn from data. Zoubin Ghahramani 12 / 51 BAYES RULE P(hypothesis)P(data hypothesis) P(hypothesis data) = j j P(h)P(data h) h j P I Bayes rule tells us how to do inference about hypotheses (uncertain quantities) from data (measured quantities). I Learning and prediction can be seen as forms of inference. Reverend Thomas Bayes (1702-1761) Zoubin Ghahramani 13 / 51 ONE SLIDE ON BAYESIAN MACHINE LEARNING Everything follows from two simple rules: Sum rule: P(x) = y P(x; y) Product rule: P(x; y) = P(x)P(y x) P j Learning: P( θ; m)P(θ m) P(Djθ; m) likelihood of parameters θ in model m P(θ ; m)= Dj j P(θjm) prior probability of θ jD P( m) P(θjD; m) posterior of θ given data D Dj Prediction: P(x ; m) = P(x θ; ; m)P(θ ; m)dθ jD j D jD Z Model Comparison: P( m)P(m) P(m ) = Dj jD P( ) Zoubin Ghahramani D 14 / 51 WHY SHOULD WE CARE? Calibrated model and prediction uncertainty: getting systems that know when they don’t know. Automatic model complexity control and structure learning (Bayesian Occam’s Razor) Zoubin Ghahramani 15 / 51 WHAT DO I MEAN BY BEING BAYESIAN? y Let’s returnoutputs to the example of neural networks / deep learning: Dealingweights with all sources of parameter uncertainty Alsohidden potentially dealing with structure uncertainty units weights Feedforward neural nets model p(y(n) x(n); θ) inputs j x Parameters θ are weights of neural net. Structure is the choice of architecture, number of hidden units and layers, choice of activation functions, etc. Zoubin Ghahramani 16 / 51 A NOTE ON MODELS VS ALGORITHMS In early NIPS there was an "Algorithms and Architectures" track Models: Algorithms convnets Stochastic Gradient Descent LDA Conjugate-gradients RNNs MCMC HMMs Variational Bayes and SVI Boltzmann machines SGLD State-space models Belief propagation, EP Gaussian processes ... LSTMs ... There are algorithms that target finding a parameter optimum, θ∗ and algorithms that target inferring the posterior p(θ D) j Often these are not so different Let’s be clear: “Bayesian” belongs in the Algorithms category, not the Models category. Any well defined model can be treated in a Bayesian manner. Zoubin Ghahramani 17 / 51 BAYESIAN DEEP LEARNING Bayesian deep learning can be implemented in many ways: I Laplace approximations (MacKay, 1992) I variational approximations (Hinton and van Camp, 1993; Graves, 2011) I MCMC (Neal, 1993) I Stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011) I Probabilistic back-propagation (Hernandez-Lobato et al, 2015, 2016) I Dropout as Bayesian averaging (Gal and Ghahramani, 2015, 2016) Figure from Yarin Gal’s thesis “Uncertainty in Deep Learning” (2016) NIPS 2016 workshop on Bayesian Deep Learning ! Zoubin Ghahramani 18 / 51 When do we need probabilities? Zoubin Ghahramani 19 / 51 WHEN IS THE PROBABILISTIC APPROACH ESSENTIAL? Many aspects of learning and intelligence depend crucially on the careful probabilistic representation of uncertainty: I Forecasting I Decision making I Learning from limited, noisy, and missing data I Learning complex personalised models I Data compression I Automating modelling, discovery, and experiment design Zoubin Ghahramani 20 / 51 WHY DOES UBER CARE? Many aspects of learning and intelligence depend crucially on the careful probabilistic representation of uncertainty: I Forecasting, Decision making, Learning from limited, noisy, and missing data, Learning complex personalised models, Data compression, Automating modelling, discovery, and experiment design supply (driver) and demand (rider) prediction, ETA prediction, pricing, business intelligence, modelling traffic and cities, mapping, self-driving cars, customer service, UberEATs personalization... Zoubin Ghahramani 21 / 51 Zoubin Ghahramani 22 / 51 Automatic construction and description of nonparametric models James Robert Lloyd1,DavidDuvenaud1,RogerGrosse2, Joshua B. Tenenbaum2,ZoubinGhahramani1 1: DepartmentAutomatic of Engineering, construction University of Cambridge, UKand 2: Massachusettsdescription Institute of Technology, nonparametric USA models James Robert Lloyd1,DavidDuvenaud1,RogerGrosse2, Modelling structure through This analysis was automatically generated GaussianJoshua process B. Tenenbaum kernels 2,ZoubinGhahramaniAutomatically1 describingthat the test statistic model was larger properties in magnitude under the posterior compared to the prior unexpectedly often. Scalable Variational Gaussian Process Classification 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts Institute of Technology, USA ACF Periodogram QQ • The kernel specifies which structures are likely under the GP prior # min min loc max max loc max min -whichdeterminesthegeneralisationpropertiesofthemodel. How to automatically describe arbitrarily complex1 0.502 kernels:0.582 0.341 0.413 0.341 0.679 2 0.802 0.199 0.558 0.630 0.049 0.785 Modelling structure through • The kernel is distributed into a sum of products 3 0.251 0.475 0.799 0.447 0.534 0.769 This analysis was automatically generated Squared Periodic Linear Automatically describing4 0.527 model0.503 properties0.504 0.481 0.430 0.616 Gaussian process kernels • Sums of kernels are sums of functions so each product is described separately 0.5 Exponential (SE) (Per) (Lin) 5 0.493 0.477 0.503 0.487 0.518 0.381 • Each kernel in a product modifies the model in a consistent way... • TheAutomatic kernel specifies which construction structures are likely and under thedescription GP prior of nonparametric modelsTable 2: Model checking statistics for each component. Cumulative probabilities for minimum of KLSp M=4 -whichdeterminesthegeneralisationpropertiesofthemodel. • ...soonekernelisdescribedbyaHow
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages72 Page
-
File Size-