Probabilistic Machine Learning: Foundations and Frontiers
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Machine Learning: Foundations and Frontiers Zoubin Ghahramani1,2,3,4 1 University of Cambridge 2 Alan Turing Institute 3 Leverhulme Centre for the Future of Intelligence 4 Uber AI Labs [email protected] http://mlg.eng.cam.ac.uk/zoubin/ Sackler Forum, National Academy of Sciences Washington DC, 2017 Machine Learning Zoubin Ghahramani 2 / 53 Many Related Terms Statistical Modelling Artificial Intelligence Neural Networks Machine Learning Data Mining Data Analytics Deep Learning Pattern Recognition Data Science Zoubin Ghahramani 3 / 53 Many Related Fields Engineering Computer Science Statistics Machine Learning Computational Applied Mathematics Neuroscience Economics Cognitive Science Physics Zoubin Ghahramani 4 / 53 Many Many Applications Bioinformatics Computer Vision Robotics Scientific Data Analysis Natural Language Processing Machine Learning Information Retrieval Speech Recognition Recommender Systems Signal Processing Machine Translation Medical Informatics Targeted Advertising Finance Data Compression Zoubin Ghahramani 5 / 53 Machine Learning • Machine learning is an interdisciplinary field that develops both the mathematical foundations and practical applications of systems that learn from data. Main conferences and journals: NIPS, ICML, AISTATS, UAI, KDD, JMLR, IEEE TPAMI Zoubin Ghahramani 6 / 53 Canonical problems in machine learning Zoubin Ghahramani 7 / 53 Classification x x x o x xxx o x o o o x x o o x o • Task: predict discrete class label from input data • Applications: face recognition, image recognition, medical diagnosis… • Methods: Logistic Regression, Support Vector Machines (SVMs), Neural Networks, Random Forests, Gaussian Process Classifiers… Zoubin Ghahramani 8 / 53 Regression D y x • Task: predict continuous quantities from input data • Applications: financial forecasting, click-rate prediction, … • Methods: Linear Regression, Neural Networks, Gaussian Processes, … Zoubin Ghahramani 9 / 53 Clustering ) • Task: group data together so that similar points are in the same group • Applications: bioinformatics, astronomy, document modelling, network modelling, … • Methods: k-means, Gaussian mixtures, Dirichlet process mixtures, … Zoubin Ghahramani 10 / 53 Dimensionality Reduction 2 1.5 1 0.5 0 !0.5 !1 !1.5 1 !2 1 0.5 0.5 0 !0.5 !1 0 • Task: map high-dimensional data onto low dimensions while preserving relevant information • Applications: any where the raw data is high-dimensional • Methods: PCA, factor analysis, MDS, LLE, Isomap, GPLVM,… Zoubin Ghahramani 11 / 53 Semi-supervised Learning ✦ ❄ ✰ ✦ ✰ ✰ • Task: learn from both labelled and unlabelled data • Applications: any where labelling data is expensive, e.g. vision,speech… • Methods: probabilistic models, graph-based SSL, transductive SVMs… Zoubin Ghahramani 12 / 53 Computer Vision: Object, Face and Handwriting Recognition, Image Captioning Computer Games Autonomous Vehicles Neural networks and deep learning Zoubin Ghahramani 14 / 53 N EURAL NETWORKS y Neural networks (n) (n) N outputs Data: D = {(x , y )}n=1 =(X, y) Parameters θ are weights of neural net. weights Neural nets model p(y(n)|x(n), θ) as a nonlin- hidden units ear function of θ and x, e.g.: weights (n) (n) θ (n) p(y = 1|x , )=σ( θixi ) inputs !i x Multilayer neural networks model the overall function as a composition of functions (layers), e.g.: (n) (2) (1) (n) (n) y = θj σ( θji xi )+ǫ !j !i Usually trained to maximise likelihood (or penalised likelihood) using variants of stochastic gradient descent (SGD) optimisation. Zoubin Ghahramani 15 / 53 DP EE LEARNING Deep learning systems are neural network models similar to those popular in the ’80s and ’90s, with: ◮ some architectural and algorithmic innovations (e.g. many layers, ReLUs, dropout, LSTMs) ◮ vastly larger data sets (web-scale) ◮ vastly larger-scale compute resources (GPU, cloud) ◮ much better software tools (Theano, Torch, TensorFlow) ◮ vastly increased industry investment and media hype figure from http://www.andreykurenkov.com/ Zoubin Ghahramani 16 / 53 L IMITATIONS OF DEEP LEARNING Neural networks and deep learning systems give amazing performance on many benchmark tasks but they are generally: ◮ very data hungry (e.g. often millions of examples) ◮ very compute-intensive to train and deploy (cloud GPU resources) ◮ poor at representing uncertainty ◮ easily fooled by adversarial examples ◮ finicky to optimise: non-convex + choice of architecture, learning procedure, initialisation, etc, require expert knowledge and experimentation ◮ uninterpretable black-boxes, lacking in trasparency, difficult to trust Zoubin Ghahramani 17 / 53 Beyond deep learning Zoubin Ghahramani 18 / 53 M ACHINE LEARNING AS PROBABILISTIC MODELLING ◮ A model describes data that one could observe from a system ◮ If we use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model... ◮ ...then inverse probability (i.e. Bayes rule) allows us to infer unknown quantities, adapt our models, make predictions and learn from data. Zoubin Ghahramani 19 / 53 BESAY RULE P(hypothesis)P(data|hypothesis) P(hypothesis|data)= h P(h)P(data|h) " ◮ Bayes rule tells us how to do inference about hypotheses (uncertain quantities) from data (measured quantities). ◮ Learning and prediction can be seen as forms of inference. Reverend Thomas Bayes (1702-1761) Zoubin Ghahramani 20 / 53 O NE SLIDE ON BAYESIAN MACHINE LEARNING Everything follows from two simple rules: Sum rule: P(x)= y P(x, y) Product rule: P(x, y)=P"(x)P(y|x) Learning: P(D|θ, m)P(θ|m) P(D|θ, m) likelihood of parameters θ in model m P(θ|D, m) = P(θ|m) prior probability of θ P(D|m) P(θ|D, m) posterior of θ given data D Prediction: P(x|D, m)= P(x|θ, D, m)P(θ|D, m)dθ # Model Comparison: P(D|m)P(m) P(m|D)= P(D) Zoubin Ghahramani 21 / 53 W HY SHOULD WE CARE? Calibrated model and prediction uncertainty: getting systems that know when they don’t know. Automatic model complexity control and structure learning (Bayesian Occam’s Razor) Zoubin Ghahramani 22 / 53 WTDOHA I MEAN BY BEING BAYESIAN? Let’s return to the example of neural networks / deep learning: Dealing with all sources of parameter uncertainty Also potentially dealing with structure uncertainty y Feedforward neural nets model p y(n) x(n), θ outputs ( | ) weights Parameters θ are weights of neural net. hidden Structure is the choice of architecture, units number of hidden units and layers, choice of weights activation functions, etc. inputs x Zoubin Ghahramani 23 / 53 When do we need probabilities? Zoubin Ghahramani 25 / 53 W HEN IS THE PROBABILISTIC APPROACH ESSENTIAL? Many aspects of learning and intelligence depend crucially on the careful probabilistic representation of uncertainty: ◮ Forecasting ◮ Decision making ◮ Learning from limited, noisy, and missing data ◮ Learning complex personalised models ◮ Data compression ◮ Automating scientific modelling, discovery, and experiment design Zoubin Ghahramani 26 / 53 Automating model discovery: The automatic statistician Zoubin Ghahramani 28 / 53 T HE AUTOMATIC STATISTICIAN Language of models Translation Data Search Model Prediction Report Evaluation Checking Problem: Data are now ubiquitous; there is great value from understanding this data, building models and making predictions... however, there aren’t enough data scientists, statisticians, and machine learning experts. Zoubin Ghahramani 29 / 53 T HE AUTOMATIC STATISTICIAN Language of models Translation Data Search Model Prediction Report Evaluation Checking Problem: Data are now ubiquitous; there is great value from understanding this data, building models and making predictions... however, there aren’t enough data scientists, statisticians, and machine learning experts. Solution: Develop a system that automates model discovery from data: ◮ processing data, searching over models, discovering a good model, and explaining what has been discovered to the user. Zoubin Ghahramani 29 / 53 B ACKGROUND:GAUSSIAN PROCESSES Consider the problem of nonlinear regression: You want to learn a function f with error bars from data D = {X, y} y x A Gaussian process defines a distribution over functions p(f ) which can be used for Bayesian regression: p(f )p(D|f ) p(f |D)= p(D) Definition: p(f ) is a Gaussian process if for any finite subset {x1,...,xn}⊂X, the marginal distribution over that subset p(f) is multivariate Gaussian. GPs can be used for regression, classification, ranking, dim. reduct... Zoubin Ghahramani 31 / 53 Automatic Statistician for Regression and Time-Series Models Zoubin Ghahramani 33 / 53 T HE ATOMS OF OUR LANGUAGE OF MODELS Five base kernels 0 0 0 0 0 Squared Periodic Linear Constant White exp. (SE) (PER) (LIN) (C) noise (WN) Encoding for the following types of functions Smooth Periodic Linear Constant Gaussian functions functions functions functions noise Zoubin Ghahramani 35 / 53 T HE COMPOSITION RULES OF OUR LANGUAGE ◮ Two main operations: addition, multiplication 0 0 quadratic locally LIN × LIN SE × PER functions periodic 0 0 periodic plus periodic plus LIN + PER SE + PER linear trend smooth trend Zoubin Ghahramani 36 / 53 MELOD SEARCH:MAUNA LOA KEELING CURVE ❘ ✻ ✁ ✹ ✁ Start ✷✁ ✁ SE RQ LIN PER ✦✷✁ ✷✁✁✁ ✷✁ ✁✂ ✷✁✄✁ SE + RQ ... PER + RQ ... PER × RQ SE + PER + RQ ... SE × (PER + RQ) ... ... ... ... Zoubin Ghahramani 37 / 53 MELOD SEARCH:MAUNA LOA KEELING CURVE ✭ ✁✂ ✄ ☎✆ ✝ ✹ ✞ ✸ ✞ Start ✷ ✞ ✠ ✞ SE RQ LIN PER ✞ ✷✞ ✞✞ ✷✞✞✟ ✷✞✠ ✞ SE + RQ ... PER + RQ ... PER × RQ SE + PER + RQ ... SE × (PER + RQ) ... ... ... ... Zoubin Ghahramani 37 / 53 MELOD SEARCH:MAUNA LOA KEELING CURVE ❙ ✦ ✥ ✁ ✂✄ ☎ ✆✝ ✞ ✠ ✟ ✹ ✟ Start