<<

Probabilistic machine learning: foundations and frontiers

Zoubin Ghahramani1,2 1 2 AI Labs [email protected] http://mlg.eng.cam.ac.uk/zoubin/ https://www.uber.com/info/ailabs/ MLSS 2017 Tübingen An exciting time for Machine Learning!

Zoubin Ghahramani 2 / 51 APPLICATIONS OF MACHINE LEARNING

Speech and Language Technologies, Computer Vision: Scientific Data Analysis automatic speech recognition, (e.g. , Astronomy) machine translation, question-answering, Object, Face and Handwriting dialog systems Recognition, Image Captioning

Recommender Systems Self-driving cars Financial Prediction and Automated Trading

Zoubin Ghahramani 3 / 51 Computer Games

DeepMind’s system learning to play ATARI games at super-human level Computer Games GAMES

Zoubin Ghahramani 4 / 51 ML: MANYMETHODSWITHMANYLINKSTO OTHER FIELDS

Kernel Methods, SVMs, Gaussian Reinforcement Learning Processes Symbolic Systems, Control Theory Logic-based (ILP), Decision Making Relational Learning

Probabilistic Models Bayesian Methods Neural Networks Deep Learning Graphical Models Causal Inference

Decision Trees Unsupervised Learning: Random Forests Feature Discovery Clustering Linear Statistical Models Dim. Reduction Logistic Regression Zoubin Ghahramani 5 / 51 Neural networks and deep learning

Zoubin Ghahramani 6 / 51 NEURALNETWORKS Neural networks are tunable nonlinear func- y tions with many parameters. outputs weights Parameters θ are weights of neural net. (n) (n) hidden Neural nets model p(y x , θ) as a nonlin- units | ear function of θ and x, e.g.: weights (n) (n) (n) inputs p(y = 1 x , θ) = σ( θixi ) x | Xi Multilayer neural networks model the overall function as a composition of functions (layers), e.g.: (n) (2) (1) (n) (n) y = θj σ( θji xi ) +  Xj Xi Usually trained to maximise likelihood (or penalised likelihood) using variants of stochastic gradient descent (SGD) optimisation. NN = nonlinear function + basic stats + basic optimisation Zoubin Ghahramani 7 / 51 DEEP LEARNING

Deep learning systems are neural network models similar to those popular in the ’80s and ’90s, with:

I some architectural and algorithmic innovations (e.g. many layers, ReLUs, dropout, LSTMs)

I vastly larger data sets (web-scale)

I vastly larger-scale compute resources (GPU, cloud)

I much better software tools (Theano, Torch, TensorFlow)

I vastly increased industry investment and media hype

figure from http://www.andreykurenkov.com/ Zoubin Ghahramani 8 / 51 LIMITATIONS OF DEEP LEARNING

Neural networks and deep learning systems give amazing performance on many benchmark tasks but they are generally: I very data hungry (e.g. often millions of examples) I very compute-intensive to train and deploy (cloud GPU resources) I poor at representing uncertainty I easily fooled by adversarial examples I finicky to optimise: non-convex + choice of architecture, learning procedure, initialisation, etc, require expert knowledge and experimentation I uninterpretable black-boxes, lacking in trasparency, difficult to trust

Zoubin Ghahramani 9 / 51 Beyond deep learning

Zoubin Ghahramani 10 / 51 TOOLBOXVSMODELLINGVIEWSOF MACHINE LEARNING

I Machine Learning is a toolbox of methods for processing data: feed the data into one of many possible methods; choose methods that have good theoretical or empirical performance; make predictions and decisions

I Machine Learning is the science of learning models from data: define a space of possible models; learn the parameters and structure of the models from data; make predictions and decisions

Zoubin Ghahramani 11 / 51 MACHINE LEARNINGAS PROBABILISTIC MODELLING

I A model describes data that one could observe from a system

I If we use the mathematics of probability theory to express all forms of uncertainty and noise associated with our model...

I ...then inverse probability (i.e. Bayes rule) allows us to infer unknown quantities, adapt our models, make predictions and learn from data.

Zoubin Ghahramani 12 / 51 BAYES RULE

P(hypothesis)P(data hypothesis) P(hypothesis data) = | | P(h)P(data h) h | P

I Bayes rule tells us how to do inference about hypotheses (uncertain quantities) from data (measured quantities). I Learning and prediction can be seen as forms of inference. Reverend Thomas Bayes (1702-1761)

Zoubin Ghahramani 13 / 51 ONESLIDEON BAYESIANMACHINELEARNING

Everything follows from two simple rules:

Sum rule: P(x) = y P(x, y) Product rule: P(x, y) = P(x)P(y x) P |

Learning:

P( θ, m)P(θ m) P(D|θ, m) likelihood of parameters θ in model m P(θ , m)= D| | P(θ|m) prior probability of θ |D P( m) P(θ|D, m) posterior of θ given data D D| Prediction: P(x , m) = P(x θ, , m)P(θ , m)dθ |D | D |D Z Model Comparison: P( m)P(m) P(m ) = D| |D P( ) Zoubin Ghahramani D 14 / 51 WHYSHOULDWECARE?

Calibrated model and prediction uncertainty: getting systems that know when they don’t know.

Automatic model complexity control and structure learning (Bayesian Occam’s Razor)

Zoubin Ghahramani 15 / 51 WHAT DO I MEANBYBEING BAYESIAN?

Let’s return to the example of neural networks / deep learning: Dealing with all sources of parameter uncertainty Also potentially dealing with structure uncertainty

y Feedforward neural nets model p(y(n) x(n), θ) outputs | weights Parameters θ are weights of neural net.

hidden Structure is the choice of architecture, units number of hidden units and layers, choice of weights activation functions, etc. inputs x

Zoubin Ghahramani 16 / 51 A NOTEON MODELSVS ALGORITHMS

In early NIPS there was an "Algorithms and Architectures" track Models: Algorithms convnets Stochastic Gradient Descent LDA Conjugate-gradients RNNs MCMC HMMs Variational Bayes and SVI Boltzmann machines SGLD State-space models Belief propagation, EP Gaussian processes ... LSTMs ... There are algorithms that target finding a parameter optimum, θ∗ and algorithms that target inferring the posterior p(θ D) | Often these are not so different Let’s be clear: “Bayesian” belongs in the Algorithms category, not the Models category. Any well defined model can be treated in a Bayesian manner. Zoubin Ghahramani 17 / 51 BAYESIAN DEEP LEARNING

Bayesian deep learning can be implemented in many ways: I Laplace approximations (MacKay, 1992) I variational approximations (Hinton and van Camp, 1993; Graves, 2011) I MCMC (Neal, 1993) I Stochastic gradient Langevin dynamics (SGLD; Welling and Teh, 2011) I Probabilistic back-propagation (Hernandez-Lobato et al, 2015, 2016) I Dropout as Bayesian averaging (Gal and Ghahramani, 2015, 2016)

Figure from Yarin Gal’s “Uncertainty in Deep Learning” (2016) NIPS 2016 workshop on Bayesian Deep Learning → Zoubin Ghahramani 18 / 51 When do we need probabilities?

Zoubin Ghahramani 19 / 51 WHENISTHEPROBABILISTICAPPROACH ESSENTIAL?

Many aspects of learning and intelligence depend crucially on the careful probabilistic representation of uncertainty: I Forecasting I Decision making I Learning from limited, noisy, and missing data I Learning complex personalised models I Data compression I Automating modelling, discovery, and experiment design

Zoubin Ghahramani 20 / 51 WHYDOES UBERCARE? Many aspects of learning and intelligence depend crucially on the careful probabilistic representation of uncertainty: I Forecasting, Decision making, Learning from limited, noisy, and missing data, Learning complex personalised models, Data compression, Automating modelling, discovery, and experiment design

supply (driver) and demand (rider) prediction, ETA prediction, pricing, business intelligence, modelling traffic and cities, mapping, self-driving cars, customer service, UberEATs personalization... Zoubin Ghahramani 21 / 51 Zoubin Ghahramani 22 / 51 Automatic construction and description of nonparametric models James Robert Lloyd1,DavidDuvenaud1,RogerGrosse2, Joshua B. Tenenbaum2,ZoubinGhahramani1 1: DepartmentAutomatic of Engineering, construction University of Cambridge, UKand 2: Massachusettsdescription Institute of Technology, nonparametric USA models James Robert Lloyd1,DavidDuvenaud1,RogerGrosse2, Modelling structure through This analysis was automatically generated GaussianJoshua process B. Tenenbaum kernels 2,ZoubinGhahramaniAutomatically1 describingthat the test statistic model was larger properties in magnitude under the posterior compared to the prior unexpectedly often. Scalable Variational Gaussian Process Classification 1: Department of Engineering, University of Cambridge, UK 2: Massachusetts Institute of Technology, USA ACF Periodogram QQ • The kernel specifies which structures are likely under the GP prior # min min loc max max loc max min -whichdeterminesthegeneralisationpropertiesofthemodel. How to automatically describe arbitrarily complex1 0.502 kernels:0.582 0.341 0.413 0.341 0.679 2 0.802 0.199 0.558 0.630 0.049 0.785 Modelling structure through • The kernel is distributed into a sum of products 3 0.251 0.475 0.799 0.447 0.534 0.769 This analysis was automatically generated Squared Periodic Linear Automatically describing4 0.527 model0.503 properties0.504 0.481 0.430 0.616 Gaussian process kernels • Sums of kernels are sums of functions so each product is described separately 0.5 Exponential (SE) (Per) (Lin) 5 0.493 0.477 0.503 0.487 0.518 0.381 • Each kernel in a product modifies the model in a consistent way... • TheAutomatic kernel specifies which construction structures are likely and under thedescription GP prior of nonparametric modelsTable 2: Model checking statistics for each component. Cumulative probabilities for minimum of KLSp M=4 -whichdeterminesthegeneralisationpropertiesofthemodel. • ...soonekernelisdescribedbyaHow to automaticallyautocorrelation nounphrase,andtheothersmodifyit describe function arbitrarily (ACF) andcomplex its location. kernels: Cumulative probabilities for maximum of peri- James Robert Lloyd1,DavidDuvenaud1,RogerGrosse2, odogram and its location. p-values for maximum and minimum deviations of QQ-plot from straight KLSp M=50 • Text descriptions• areThe complemented kernel is distributedline. by plots into of a thesum posterior of products 0.4 Squared JoshuaPeriodic B. TenenbaumLinear2,ZoubinGhahramani1 • Sums of kernels are sums of functions so each product is described separately KLSp M=200 local variation repeatingExponential structure1: Department (SE) linear of Engineering,( functionsPer) University of Cambridge,(Lin) UK 2: Massachusetts Institute of Technology, USA Kernels can be distributed• Each kernel into inThe a product a nature sum modifies of any products observed the model discrepancies in a consistent is now way... described and plotted and hypotheses are given for the patterns in the data that may not be captured by the model. MFSp M=4 • Composite kernels can express many types of structure • ...soonekernelisdescribedbya nounphrase,andtheothersmodifyit Scalable variational Gaussian process classification This analysis was automatically generated Modelling structure through SEAutomaticallyLin + Per describing+ SE model properties An automatic reportGaussian for the processdataset kernels : 10-sulphuric• Text descriptions⇥ 4.1 are Moderately complemented statistically by plots significant of the posterior discrepancies 0.3 MFSp M=50 Lin Lin SE Per Lin + Per SE + Per ⇥ ⇥ becomes (after simplification) local variation repeating• The kernel structure specifies which linearstructures functions are likely under the GP prior 4.1.1 Component 2 : An approximately periodic function with a period of 1.0 years MFSp M=200 An automatic-whichdeterminesthegeneralisationpropertiesofthemodel. report for the dataset : 03-maunaKernelsHow can to automatically be distributed describe into arbitrarily a sum complex of products kernels: (SE Lin)+(The followingSE Per discrepancies)+(SE). between the prior and posterior distributions for this component have (w/ Hensman & Matthews, AISTATS 2015) • Composite kernels can express many types of structure • The⇥ kernel is distributed⇥ into a sum of products EPFitc M=4 TheSquared Automatic StatisticianPeriodic Linear been detected. • Sums of kernels are sums ofSE functionsLin so each+ productPer + is describedSE separately 0.2 Exponential (SE) (Per) (Lin) ⇥ Example Probabilistic Program for a (HMM) Lin Lin SE Per Lin + Per SE + Per Sums of kernels correspond• Each kernel to insums a productThe of modifies functions qq the plot model has in an a consistent unexpectedly way... large positive deviation from equality (x = y). This EPFitc M=50 ⇥ An⇥ automaticThe report Automatic forStatistician the dataset : 02-solarbecomes (after simplification)• quadratic locally periodic periodic Abstract • ...soonekernelisdescribedbyadiscrepancy nounphrase,andtheothersmodifyit has an estimated p-value of 0.049. EPFitc M=200 functions periodic with trend with noise • Text descriptions are(SE complementedLin)+( by plotsSE of thePer posterior)+(SE). This report was produced by the Automatic Bayesian Covariance Discovery The positive⇥ deviation in the⇥ qq-plot can indicate heavy positive tails if it occurs at the right of the Julia Key ingredients: (ABCD) algorithm. local variationAbstract repeating structure linear functions plot or light negative tails if it occurs as the left. SE Lin Lin Per SE + SE SE SE = Kernels can be distributed+ into a sum of products+ 0.1 1 2 • Composite1 kernels can2 The express Automatic many types Statistician of structure Holdout negative log probability ⇥ ⇥ This report was produced⇥ by the Automatic Bayesian Covariance Discovery Sums of kernels correspondSE Lin to+ sumsPer + ofSE functions quadratic1 Executive(ABCD)locally summary algorithm. periodic periodic ⇥ Lin Lin SE Per Lin + Per SEentire+ Per signal becomesSE (afterLin simplification) SE Per SE statesmean = [‐1, 1, 0] # Emission parameters. 3 2 functions periodic ⇥with trend ⇥ with noise ⇥ ⇥ The raw data and full model posterior with extrapolations are shown in figure 1. smooth trend + periodicity + short-term deviation inducing0 point approximation1 O(N ) 2 O(M N) 3 4 Abstract (SE Lin)+(SE Per)+(SE). initial = Categorical([1.100/3, 1.0/3, 1.0/3]) # Prob d10istr of state[1]. 10 10 10 2.4 Component 4 : An approximately periodic function with a period= of 10.8 years.⇥ This ⇥ + + SE Lin1 Executive Lin summaryPer SE + SE SE SE function applies until 1643 and from 1716 onwards Raw data 1 2 1 2 • ! 250 This report was produced by the Automatic Bayesian Covariance Discovery trans = [Categorical([0.1, 0.5, 0.4]), Categorical([0.2, 0.2, 0.6]), ⇥ ⇥ ⇥ This component is approximately periodic with aSums period of of 10.8 kernels years. Across correspond periods the shape to of sums of functions (ABCD) algorithm. this function varies smoothly with a typical lengthscale of 36.9 years. The shape of this function increasing growing 200The raw data and full modelquadratic posterior withlocally extrapolationsperiodic are shownIf infperiodic figure(x) 1. gp(0,k ) and f (x) gp(0,k ) then f (x)+f (x) gp(0,k +k ). a single variational bound requiringTime one-dim (seconds) quadrature within1 each period is very smooth and1 resembles a sinusoid.2 This component applies until2 1643 and 1 2 1 2 f1(x1)+f2(x2) f(x1,x2) entire signal SE Lin SE Per SE Categorical([0.15, 0.15, 0.7])] # Trans distr for each state. 150 functions periodic with trend fromwith 1716 noise onwards.⇠ ⇠ ⇠ variation amplitude Therefore, a sum of kernels can be described as⇥ a sum of independent⇥ functions. This component explains 71.5% of the residual variance; this increases the total variancesmooth explained trend + periodicity + short-term deviation 100 Raw data from 72.8% to 92.3%. The addition of this component reduces the cross validated MAE by 16.82% • 40 = + + data = [Nil, 0.9, 0.8, 0.7, 0, ‐0.025, ‐5, ‐2, ‐0.1, 0, 0.13] 1 Executive summarySE Lin Lin Per SE1 + SE2 fromSE 0.181 toSE 0.15.2 QQ uncertainty plot for component 2 • Building composite kernels previously50 required30 human expertise⇥ ⇥ ⇥ 400 Figure 2: Temporal performance of the dierent methods on the image dataset. Posterior of component 4 Sum of components up to component 4 stochastic optimisation of variational parameters and inducing point locations 20 0 0.6 1362 300 1960 1965 1970 1975 1980 1985 1990 1995 0.4 entire signal SE Lin SE Per SE 10 The raw data and full model posterior with extrapolations are shown in figure 1. 1361.5 increasing growing Each0.2 kernel in aIf productf1(x) roughlygp(0,k1200) correspondsand f2(⇥x) gp to(0 an,k⇥2) adjectivethen f1(x)+f2(x) gp(0,k1 +k2). 0 f (x )+f (x ) f(x ,x ) 0 smooth trend + periodicity + short-term deviation 1 1 2 2 1 2 ⇠1361 ⇠ ⇠ variation amplitude −0.2 100 @model hmm b•egin # Define a model hmm. −10 Raw data −0.4 Therefore, a sum of kernels can be described as a sum of independent functions. 1360.5 1362 −0.6 0 −20 Figure 1: Raw data (left) and model posterior with extrapolation (right) that−0.8 the test statistic was larger in1360 magnitude under the posterior compared to the prior unexpectedly 1650 1700 1750 1800 1850 1900 1950 2000 1650 1700 1750 1800 1850 1900 1950 2000 −30 increasing growing If f1(x) gp(0,k−1001) and f2(x) gp(0,k2) then f1(x)+f2(x) gp(0,k1 +k2). states = Array(Int, length(data)) 1361.5 f (x )+f (x ) f(x ,x ) We can buildBuilding models composite1960 by kernels1965 a1970 greedy previously1975 1980 1985 1990 required search1995 2000 human2005 expertise1 1 2 2 often.1 2 Kernel How⇠ it modifies the⇠ prior ⇠ • variation amplitude Therefore, a sum− of200 kernels can be described as a sum of independent functions. Figure 8: Pointwise posterior of component 4 (left) and the posterior of the cumulative sum of The structure1361 search algorithm has identified nine additive componentscomponents in the with data. data (right) The firstEach 4 kernel in a− product300 roughly corresponds to an adjective @assume(states[1] ~0 i.n4itial) • Building composite kernels previously required human expertise SE ACFfunctions− changePeriodogram250 −200 −150 smoothly−100 −50 QQ0 50 100 150 200 250 6 Discussion additive components1360.5 Figure explain 1: Raw 90.5% data of(left) the and variation model in posterior the data with as shown extrapolation by the coefficient(right) of de- min min loc max max loc max min 2 # Each kernel in a product roughly corresponds to an adjective termination (R ) values in table 1. The first 8 additive components explain 99.8% of the variation1 Per0.502 functions0.582 0.341 repeat 0.413 0.341 0.679 for i = 2:length(data) 1360 0.38 Wein theNo can data. structurebuild After1650 the1700 firstmodels1750 6 components1800 1850 by1900 the1950 a crossgreedy2000 2050 validated search mean absolute error (MAE) does not We can build models by a greedy search 2 Lin0.802 standard0.199Figure0.558 deviationKernel 17: ACF0.630How varies (top it0.049 modifiesleft), linearly0.785 periodogram the prior (top right) and quantile-quantile (bottom left) uncertainty @assume(states[i] ~ trans[states[i‐1]]) decreaseThe structure by more search than 0.1%.algorithm This has suggests identified that five subsequent additive components terms are modelling in the data. very The short first additive term3 0.251 0.475plots.0.799Kernel TheHow blue0.447 it line modifies0.534 and the shading0.769 prior are the pointwise mean and 90% confidence interval of the plots trends, uncorrelated noise or are artefacts of the model or search procedure. Short summaries of the4 0.527 0.503 0.504SE 0.481functions0.430 change0.616 smoothly 0.36 We have presented two novel variational bounds for component explainsFigure 98.6% 1: of Raw the variation data (left) in and the model data as posterior shown by with the extrapolation coefficient of (right) determination underSE the priorfunctions distribution change smoothly for component 2. The green line and green dashed lines are the corre- @observe(data[i] ~ Normal(statesmean[states[i]], 0.4)) additive(R2)components values in table are1 as. The follows: first 2 additive components explain 99.9% of the variation in the5 data.0.493 0.477 0.503PerPer functions0.487functions repeat0.518 repeat0.381 No structure sponding quantities under the posterior. After the first 3 componentsNo the structure cross validated mean absolute error (MAE) doesFigure not 9: Pointwise decrease posterior by of residuals after adding componentLin 4 standard deviation varies linearly SE LinA very smooth function. Per Table 2: Model checking statistics for each component.Lin standard Cumulative deviation probabilities varies for linearly minimum of end 0.34 performing sparse GP classification. The first, like more• The than structure 0.1%. This search suggests algorithm that subsequent has identified terms eight are modellingadditive componentsExample very short term in description the trends, data. uncor- The first 4 error (%) autocorrelation function (ACF) and its location. Cumulative probabilities for maximum of peri- relatedadditiveA noise constant. or components are This artefacts function explain of applies the 92.3% model from of or1964 the search variation until procedure. 1990. in the data Short as summaries shown by the of thecoefficient additive of de- @predict states • 2 odogram and its location. p-values for maximum and minimum deviations of QQ-plot from straight 0.32 the existing GFITC method, makes an approximation componentsterminationAn approximately are as (R follows:) values periodic in tablefunction1SE. The with first a period 6 additive ofLin 1.0 components years. explainPer 99.7% of the variation 4.2 Model checking plots for components without statistically significant discrepancies Scalable Variational Gaussian Process Classification line. Example description 2 3 4 5 • inSE the data. After the first 5Lin components the cross validatedPer mean absolute error (MAE) does not end 10 10 10 10 Lin SE A smoothA very function.smooth... monotonically This functionLin increasing appliesPer from function. 1969 until 1977. SE Per4.2.1 Component 1 : ALin very smooth monotonically increasing function to the covariance matrix before introducing the non- Lin + Per • decrease• by more than 0.1%. This suggests that subsequent terms are modelling veryExample short⇥ term description ⇥ ⇥ trends,A smooth uncorrelated function.This noise function orLin are+ Per⇥ artefacts appliesLin offrom the 1964SE model until or search 1969... and procedure. fromLin 1977 ShortPer onwards. summaries of the SE Per Lin • An approximately periodic function with a⇥ period of 1.0 years.The natureapproximately⇥ of any observed discrepanciesperiodic function is now⇥ describedwith and linearly plotted⇥ growing and hypotheses amplitude are given for • approximatelyNo discrepanciesperiodic function betweenwith linearlythe prior growing and amplitude posterior of this component have been detected additiveA periodic components function with are as a period follows: of 2.6 years. This function appliesthe patterns until 1964. in the data that may not be captured by the model. conjugate likelihood. These approaches are somewhat • A smooth function. |{z} |{z|{z}} |{z} |{z} |{z} 0.5 Lin +Uncorrelated•Per Lin noise. ThisSE function applies... until 1964.Lin Per Per hasSE been chosen to act as thePer noun while SE and Lin modify theLin description 0.66 ... • A constant.Lin ...SE + SE ... Lin (SE + Per)Per...has been chosen to act as the noun⇥ while SE and Lin⇥ modify the description Haskell Lin SE + SE Lin (SEUncorrelatedUncorrelated+•Per noise.) noise.⇥ This⇥ function applies from 1964⇥ until⇥ 1990.4.1 Moderately statisticallyapproximately significant discrepanciesperiodic function with linearly growing amplitude unsatisfactory since in performing approximate infer- KLSp M=4 ⇥ ⇥• • A constant. This function applies from 1643 until 1716. 8 UncorrelatedA rapidly• varyingnoise. This smooth function function. applies from 1990 onwards. • • A smooth function. This function applies until 1643 and4.1.1 from 1716 Component onwards. 2 : An approximately|{z} periodic function|{z} with a period of 1.0 years|{z} 0.64 KLSp M=50 ...... Code available at github.com/jamesrobertlloyd/gpss-research 0.4 •...... Per has been chosen to act as the noun while SE and Lin modify the description Probabilistic ence, we necessarily introduce additional parameters Lin SEModel+ SE checking statisticsAn approximatelyLin are summarised(SE periodic+ inPer function table) 2 within section a period4. of These 10.8 years. statistics This have function revealed applies until Model checking• statistics are summarised in table 2 in section 4. TheseThe statistics following have discrepancies not revealed between the prior and posterior distributions for this component have KLSp M=200 ...⇥ statisticallyany inconsistencies... significant1643 discrepanciesbetween and... from the⇥ 1716 model between onwards. and the observed data and data. model in component 1. anglicanHMM :: Dist [n] Codebeen availabledetected. at github.com/jamesrobertlloyd/gpss-research (variational means and variances, or the parameters MFSp M=4 A rapidly varying smooth function. This function applies until 1643 and from 1716 on- 0.62 TheThe rest rest of the of the document• document is structured is structured as follows. as follows. In section In section2 the2 the forms forms of the of the additive additive components components

anglicanHMM = fm-loga p(y) p (take (length values) . fst) $ score (length values ‐ 1) areare described described and andwards. their their posterior posterior distributions distributions are are displayed. displayed. In section In section3 the3 the modelling modellingThe assumptionsqq assumptions plot has an unexpectedly large positive deviation from equality (x = y). This 0.3 MFSp M=50 ...... • of EP factors), which naturally scale linearly with N. of each componentUncorrelated are discussed noise with standardreference deviation to how this increasing affects the linearly extrapolations awaydiscrepancy from made 1837.Code has by This anthe available estimated func- p-valueat github.com/jamesrobertlloyd/gpss-research of 0.049. (hmm init trans gen) where MFSp M=200 model. Section• tion4 discusses applies until model 1643 checking and from statistics, 1716 onwards. with plots showing the form of any detected 0.6 1 2 3 4 5 discrepancies betweenUncorrelated the model noise and with observed standard data. deviation increasing linearlyThe positive away from deviation 1952. in This the func- qq-plot can indicate heavy positive tails if it occurs at the right of the states = [0,1,2] 10 10 10 10 EPFitc M=4 • tion applies until 1643 and from 1716 onwards. plot or light negative tails if it occurs as the left. Programming Our0 proposed.2 KLSP bound outperforms the state-of- init = uniform states EPFitc M=50 Uncorrelated noise. This function applies1 from 1643 until 1716. • Time (seconds) EPFitc M=200 trans 0 = fromList $ zip states [0.1,0.5,0.4] the art GFITC method on benchmark datasets, and is Model checking statistics are summarised in table 2 in section 4. These statistics have revealed statistically significant discrepancies between the data and model in component 8. Figure 3: Performance of the airline delay dataset. 0.1 Automatic trans 1 = fromList $ zip states [0.2,0.2,0.6] capableHoldout negative log probability of being optimized in a stochastic fashion as trans 2 = ThefromLi redst $ horizontalzip states [0 line.15,0 depicts.15,0.7] performance of a linear 1 we have100 shown, making101 GP classification102 applicable103 to 104 gen 0 = certainly (‐1) classifier, whilst the blue line shows performance of the big data for the first time. Time (seconds) gen 1 = cestochasticallyrtainly 1 optimized KLsparse method. Statistician gen 2 = certainly 0 In futureFigure work, 2: Temporal we note performance that this of the work dierent opens methods the on thedoorimage dataset. values = [0.9,0.8,0.7] :: [Double] addNoise = flip Normal 1 for0.4 several other GP models: if the likelihood6 Discussion factorizes score 0 Resultsd = d for GP classification on 5.9 millionin0.38N data, then our points method. is applicable through Gauss- Figure 17: ACF (top left), periodogram (top right) and quantile-quantile (bottom left) uncertainty 0.36 We have presented two novel variational bounds for plots. The blue line and shading are the pointwise mean and 90% confidence interval of the plots score n d = score (n‐1) $ condition d (prob . (`pdf` (values !! n)) Large-scaleHermite quadrature of the log likelihood. We also note under the prior distribution for component 2. The green line and green dashed lines are the corre- 0.34 performing sparse GP classification. The first, like our randomly selected hold-out . addNois sete . ( of!! 100000n) . snd) points, error (%) sponding quantities under the posterior. 0.32 the existing GFITC method, makes an approximation that2 it is possible3 to relax4 the restriction5 of q(u)to this achieved an error rate of 37%, the negative-log 10 10 10 10 to the covariance matrix before introducing the non- 4.2 Model checking plots for components without statistically significant discrepancies a Gaussian form, and mixture model approximations probability of the held-out data was 0.642. 0.66 conjugate likelihood. These approaches are somewhat 4.2.1 Component 1 : A very smooth monotonically increasing function Inferencefollow straightforwardly, allowing scalableunsatisfactory extensions since in performing approximate infer- 0.64 No discrepancies between the prior and posterior of this component have been detected We built a Gaussian process kernel using the sum of of Nguyen and Bonilla [2014]. ence, we necessarily introduce additional parameters 3 0.62 (variational means and variances, or the parameters 8 a Matern- and a linear kernel. For each, we intro- -log p(y) of EP factors), which naturally scale linearly with N. 2 0.6 2 3 4 5 duced parameters which allowed for the scaling of the Acknowledgements10 10 10 JH was10 supportedOur proposed KLSP by bound a outperforms the state-of- inputs (sometimes called Automatic Relevance Deter- Time (seconds) the art GFITC method on benchmark datasets, and is FigureMRC 3: Performance fellowship, of the airlineAM delay and dataset. ZG bycapable EPSRC of being optimized grant in a stochastic fashion as mination, ARD). Using a similar optimization scheme TheEP/I036575/1, red horizontal line depicts and performance a of a linear Focussedwe have shown, Research making GP classification applicable to classifier, whilst the blue line shows performance of the big data for the first time. to the MNIST data above, our method was able to ex- stochasticallyaward. optimized KLsparse method. In future work, we note that this work opens the door ceed the performance of a linear model in a few min- for several other GP models: if the likelihood factorizes utes, as shown in Figure 3. The kernel parameters at in N, then our method is applicable through Gauss- convergence suggested that the problem is highly non- our randomly selected hold-out set of 100000 points, Hermite quadrature of the log likelihood. We also note this achieved an error rate of 37%, the negative-log that it is possible to relax the restriction of q(u)to linear: the relative variance of the linear kernel was a Gaussian form, and mixture model approximations Automating probability of the held-out data was 0.642. follow straightforwardly, allowing scalable extensions negligible. The optimized lengthscales for the Matern We built a Gaussian process kernel using the sum of of Nguyen and Bonilla [2014]. 3 part of the covariance suggested that the most useful a Matern- 2 and a linear kernel. For each, we intro- duced parameters which allowed for the scaling of the Acknowledgements JH was supported by a features were the time of day and time of year. inputs (sometimes called Automatic Relevance Deter- MRC fellowship, AM and ZG by EPSRC grant Machine Learning mination, ARD). Using a similar optimization scheme EP/I036575/1, and a Google Focussed Research to the MNIST data above, our method was able to ex- award. 358ceed the performance of a linear model in a few min- utes, as shown in Figure 3. The kernel parameters at convergence suggested that the problem is highly non- Bayesian linear: the relative variance of the linear kernel was Nonlinear regression and Gaussian processes negligible. The optimized lengthscales for the Matern part of the covariance suggested that the most useful Computationalfeatures were the time of day and time of year. Consider the problem of nonlinear regressionNonparametrics: 358 You want to learn a function f with error bars from data = X, y D { } Resource Allocation y

x A Gaussian process defines a distribution over functions p(f) which can be used for Bayesian regression: p(f)p( f) p(f )= D| |D p( ) D Bayesian Deep Bayesian Let f =(f(x1),f(x2),...,f(xn)) be an n-dimensional vector of function values evaluated at n points x . Note, f is a random variable. i 2 X Learning Optimisation

Definition: p(f) is a Gaussian process if for any finite subset x1,...,xn , the marginal distribution over that subset p(f) is multivariate Gaussian.{ } ⇢ X

Zoubin Ghahramani 23 / 51 Automating Inference: Probabilistic Programming

Zoubin Ghahramani 24 / 51 Solution: I Develop Probabilistic Programming Languages for expressing probabilistic models as computer programs that generate data (i.e. simulators). I Derive Universal Inference Engines for these languages that do inference over program traces given observed data (Bayes rule on computer programs). Example languages: BUGS, Infer.NET, BLOG, STAN, Church, Venture, Anglican, Probabilistic C, Stochastic Python*, Haskell, Turing, WebPPL ... Example inference algorithms: Metropolis-Hastings, variational inference, particle filtering, particle cascade, slice sampling*, particle MCMC, nested particle inference*, austerity MCMC*

PROBABILISTIC PROGRAMMING

Problem: Probabilistic model development and the derivation of inference algorithms is time-consuming and error-prone.

Zoubin Ghahramani 25 / 51 Example languages: BUGS, Infer.NET, BLOG, STAN, Church, Venture, Anglican, Probabilistic C, Stochastic Python*, Haskell, Turing, WebPPL ... Example inference algorithms: Metropolis-Hastings, variational inference, particle filtering, particle cascade, slice sampling*, particle MCMC, nested particle inference*, austerity MCMC*

PROBABILISTIC PROGRAMMING

Problem: Probabilistic model development and the derivation of inference algorithms is time-consuming and error-prone. Solution: I Develop Probabilistic Programming Languages for expressing probabilistic models as computer programs that generate data (i.e. simulators). I Derive Universal Inference Engines for these languages that do inference over program traces given observed data (Bayes rule on computer programs).

Zoubin Ghahramani 25 / 51 PROBABILISTIC PROGRAMMING

Problem: Probabilistic model development and the derivation of inference algorithms is time-consuming and error-prone. Solution: I Develop Probabilistic Programming Languages for expressing probabilistic models as computer programs that generate data (i.e. simulators). I Derive Universal Inference Engines for these languages that do inference over program traces given observed data (Bayes rule on computer programs). Example languages: BUGS, Infer.NET, BLOG, STAN, Church, Venture, Anglican, Probabilistic C, Stochastic Python*, Haskell, Turing, WebPPL ... Example inference algorithms: Metropolis-Hastings, variational inference, particle filtering, particle cascade, slice sampling*, particle MCMC, nested particle inference*, austerity MCMC* Zoubin Ghahramani 25 / 51 PROBABILISTICThe Modelling Interface -P HMMROGRAMMING Example

K = 5;N= 201;initial= fill(1.0 / K, K) means = (collect(1.0:K)*2-K)*2

@model hmmdemo begin states = tzeros(Int,N)

# Uncomment for a Bayesian HMM # for i=1:K, T[i,:] ~ Dirichlet(ones(K)./K); end

states[1] ~ Categorical(initial) initial trans for i = 2:N states[i] ~ Categorical(vec(T[states[i-1],:])) obs[i] ~ Normal(means[states[i]], 4) states[1] states[2] states[3] ...

end statesmean return states end H. Ge MLG, Cambridge University 6/10

data[1] data[2] data[3] ...

Probabilistic programming could revolutionise scientific modelling, machine learning, and AI. NIPS 2015 tutorial by Frank Wood → Noah Goodman’s book → Turing: https://github.com/yebai/Turing.jl Zoubin Ghahramani→ 26 / 51 TURING

Show video

Zoubin Ghahramani 27 / 51 Automating Optimisation: Bayesian optimisation

Zoubin Ghahramani 28 / 51 Solution: treat as a problem of sequential decision-making and model uncertainty in the function. This has myriad applications, from robotics to drug design, to learning neural network hyperparameters.

BAYESIAN OPTIMISATION

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

Problem: Global optimisation of black-box functions that are expensive to evaluate x∗ = arg max f (x) x

Zoubin Ghahramani 29 / 51 BAYESIAN OPTIMISATION

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

Problem: Global optimisation of black-box functions that are expensive to evaluate x∗ = arg max f (x) x Solution: treat as a problem of sequential decision-making and model uncertainty in the function. This has myriad applications, from robotics to drug design, to learning neural network hyperparameters.

Zoubin Ghahramani 29 / 51 BAYESIAN OPTIMISATION: INANUTSHELL Exploration in black-box optimization

Black-box optimization in a nutshell: 1 initial sample 2 initialize our model 3 get the acquisition function ↵(x) 4 optimize it! xnext =argmax↵(x) 5 sample new data; update model 6 repeat!

7 make recommendation

[Moˇckus et al., 1978, Jones et al., 1998, Jones, 2001] 3/36 (slide thanks to Matthew W. Hoffman)

Zoubin Ghahramani 30 / 51 I evaluating derivatives is hard or impossible

I there is noise in the function evaluations

I there may be (possibly noisy) constraints

I there is prior information about the function

I one needs to optimise many similar functions

BAYESIAN OPTIMISATION: WHY IS IT IMPORTANT?

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

A good framework for thinking about any optimisation problem. It is especially useful if:

I evaluating the function is expensive

Zoubin Ghahramani 31 / 51 I there is noise in the function evaluations

I there may be (possibly noisy) constraints

I there is prior information about the function

I one needs to optimise many similar functions

BAYESIAN OPTIMISATION: WHY IS IT IMPORTANT?

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

A good framework for thinking about any optimisation problem. It is especially useful if:

I evaluating the function is expensive

I evaluating derivatives is hard or impossible

Zoubin Ghahramani 31 / 51 I there may be (possibly noisy) constraints

I there is prior information about the function

I one needs to optimise many similar functions

BAYESIAN OPTIMISATION: WHY IS IT IMPORTANT?

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

A good framework for thinking about any optimisation problem. It is especially useful if:

I evaluating the function is expensive

I evaluating derivatives is hard or impossible

I there is noise in the function evaluations

Zoubin Ghahramani 31 / 51 I there is prior information about the function

I one needs to optimise many similar functions

BAYESIAN OPTIMISATION: WHY IS IT IMPORTANT?

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

A good framework for thinking about any optimisation problem. It is especially useful if:

I evaluating the function is expensive

I evaluating derivatives is hard or impossible

I there is noise in the function evaluations

I there may be (possibly noisy) constraints

Zoubin Ghahramani 31 / 51 I one needs to optimise many similar functions

BAYESIAN OPTIMISATION: WHY IS IT IMPORTANT?

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

A good framework for thinking about any optimisation problem. It is especially useful if:

I evaluating the function is expensive

I evaluating derivatives is hard or impossible

I there is noise in the function evaluations

I there may be (possibly noisy) constraints

I there is prior information about the function

Zoubin Ghahramani 31 / 51 BAYESIAN OPTIMISATION: WHY IS IT IMPORTANT?

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

A good framework for thinking about any optimisation problem. It is especially useful if:

I evaluating the function is expensive

I evaluating derivatives is hard or impossible

I there is noise in the function evaluations

I there may be (possibly noisy) constraints

I there is prior information about the function

I one needs to optimise many similar functions

Zoubin Ghahramani 31 / 51 BAYESIAN OPTIMISATION: WHY IS IT IMPORTANT?

t=3 t=4

new observ. Posterior Posterior

next point Acquisition function Acquisition function

A good framework for thinking about any optimisation problem. It is especially useful if:

I evaluating the function is expensive

I evaluating derivatives is hard or impossible

I there is noise in the function evaluations

I there may be (possibly noisy) constraints

I there is prior information about the function

I one needs to optimise many similar functions

Zoubin Ghahramani 31 / 51 BAYESIAN OPTIMISATION Predictive Entropy Search with Unknown Constraints tial and decay rate), 2 momentum parameters (initial and final), 2 dropout parameters (input layer and other layers), 2 other regularization parameters (weight decay and max weight norm), the number of hidden units in each of the 3 hidden layers, the activation function (RELU or sigmoid). The network is trained on a using the deepnet package1, and the prediction time is computed as the average time of 1000 predictions, each for a batch of size 128. The net- work is trained on the MNIST digit classification task with momentum-based stochastic gradient descent for 5000 iter- ations. The objective is reported as the classification error rate on the validation set. As above, we treat constraint violations as the worst possible value (in this case a classi- fication error of 1.0). Figure 4. Classification error of a 3-hidden-layer neural network Figure 4 shows the results of 50 iterations of Bayesian opti- constrained to make predictions in under 2 ms. mization. In this experiment and the next, the y-axis repre- (work with J.M. Hernández-Lobato, M.A. Gelbart, M.W. Hoffman, & R.P. sents observed function values, 1 =0.05, a Matern´ 5/2 GP putation time, subject to passing of the Geweke (Geweke, covariance kernel is used, and GP hyperparametersAdams) are in- arXiv:1511.094221992) and Gelman-Rubin arXiv:1511.07130 (Gelman & Rubin arXiv:1406.2541, 1992) conver- tegrated out using slice sampling (Neal, 2000) as in Snoek gence diagnostics, as well as the constraint that the numer- et al. (2012). Curves are the mean over 5 independent ex- ical integration should not diverge. We tune 4 parameters periments. We find that PESC performs significantlyZoubin Ghahramani better of an HMC sampler: the integration step size, number of 32 / 51 than EIC. However, when the noise level is high, report- integration steps, fraction of the allotted 5 minutes spent in ing the best observation is an overly optimistic metric (due burn-in, and an HMC mass parameter (see Neal, 2011). We to “lucky” evaluations); on the other hand, ground-truth is use the coda R package (Plummer et al., 2006) to compute not available. Therefore, to validate our results further, we the effective sample size and the Geweke convergence di- used the recommendations made at the final iteration of agnostic, and the PyMC python package (Patil et al., 2010) Bayesian optimization for each method (EIC and PESC) to compute the Gelman-Rubin diagnostic over two inde- and evaluted the function with these recommended param- pendent traces. Following Gelbart et al. (2014), we impose eters. We repeated the evaluation 10 times for each of the 5 the constraints that the absolute value of the Geweke test repeated experiments to compute a ground-truth score av- score be at most 2.0 and the Gelman-Rubin score be at most eraged of 50 function evaluations. This procedure yields a 1.2, and sample from the posterior distribution of a logistic score of 0.45 0.06 for PESC and 0.79 0.03 for EIC (as in ± ± regression problem using the UCI German credit data set the Figure, constraint violations are treated as a classifica- (Frank & Asuncion, 2010). tion error of 100%). This result is consistent with Figure 4 in that PESC performs significantly better than EIC, but Figure 5 evaluates EIC and PESC on this task, averaged also demonstrates that, due to noise, Figure 4 is overly op- over 10 independent experiments. As above, we perform a timistic. While we may believe this optimism to affect both ground-truth assessment of the final recommendations. The methods equally, the ground-truth measurement provides a average effective sample size is 3300 1200 for PESC and ± more reliable result and a much clearer understanding of 2300 900 for EIC. From these results we draw a similar ± the classification error attained by Bayesian optimization. conclusion to that of Figure 5; namely, that PESC outper- forms EIC but only by a small margin, and furthermore that 5.5. Tuning Markov chain Monte Carlo the experiment is very noisy. Hybrid Monte Carlo, also known as Hamiltonian Monte Carlo (HMC), is a popular Markov Chain Monte Carlo 6. Discussion (MCMC) technique that uses gradient information in a nu- In this paper, we addressed global optimization with un- merical integration to select the next sample. However, known constraints. We described existing methods and using numerical integration gives rise to new parameters discuss their weaknesses. We presented PESC, a method like the integration step size and the number of integration based on the theoretically appealing expected information steps. Following the experimental set up in Gelbart et al. gain heuristic. We showed in Figure 1 that the mathemat- (2014), we optimize the number of effective samples pro- ical approximations involved in PESC are quite accurate, duced by an HMC sampler limited to 5 minutes of com- and that PESC performs about equally well to a ground 1https://github.com/nitishsrivastava/deepnet truth method based on rejection sampling. In sections 5.2 to 5.5, we showed that PESC outperforms current methods Automating model discovery: The automatic statistician

Zoubin Ghahramani 33 / 51 Solution: Develop a system that automates model discovery from data:

I processing data, searching over models, discovering a good model, and explaining what has been discovered to the user.

THE AUTOMATIC STATISTICIAN

Language of models Translation

Data Search Model Prediction Report

Evaluation Checking

Problem: Data are now ubiquitous; there is great value from understanding this data, building models and making predictions... however, there aren’t enough data scientists, statisticians, and machine learning experts.

Zoubin Ghahramani 34 / 51 THE AUTOMATIC STATISTICIAN

Language of models Translation

Data Search Model Prediction Report

Evaluation Checking

Problem: Data are now ubiquitous; there is great value from understanding this data, building models and making predictions... however, there aren’t enough data scientists, statisticians, and machine learning experts. Solution: Develop a system that automates model discovery from data:

I processing data, searching over models, discovering a good model, and explaining what has been discovered to the user.

Zoubin Ghahramani 34 / 51 INGREDIENTS OFWHATAN WOULD AUTOMATIC AN AUTOMATIC STATISTICIAN STATISTICIAN DO?

Language of models Translation

Data Search Model Prediction Report

Evaluation Checking

I An open-ended language of models James Robert Lloyd and Zoubin Ghahramani 3 / 43 I Expressive enough to capture real-world phenomena. . . I . . . and the techniques used by human statisticians I A search procedure I To efficiently explore the language of models I A principled method of evaluating models I Trading off complexity and fit to data I A procedure to automatically explain the models I Making the assumptions of the models explicit. . . I . . . in a way that is intelligible to non-experts Zoubin Ghahramani 35 / 51 BACKGROUND:GAUSSIANPROCESSES Consider the problem of nonlinear regression: You want to learn a function f with error bars from data = X, y D { }

y

x A Gaussian process defines a distribution over functions p(f ) which can be used for Bayesian regression: p(f )p( f ) p(f ) = D| |D p( ) D Definition: p(f ) is a Gaussian process if for any finite subset x ,..., x , the marginal distribution over that subset p(f) is { 1 n} ⊂ X multivariate Gaussian. GPs can be used for regression, classification, ranking, dim. reduct... GPflow: a Gaussian process library using TensorFlow Zoubin Ghahramani→ 36 / 51 A PICTURE:GPS, LINEARANDLOGISTIC REGRESSION, AND SVMS

Linear Logistic Regression Regression

Bayesian Bayesian Linear Logistic Regression Regression

Kernel Kernel Regression Classification

GP GP Regression Classification

Classification

Bayesian Kernel

GPflow: a Gaussian process library using TensorFlow Zoubin Ghahramani→ 37 / 51 BAYESIAN NEURALNETWORKSAND GAUSSIAN PROCESSES

y outputs Bayesian neural network Data: = (x(n), y(n)) N = (X, y) weights D { }n=1 Parameters θ are weights of neural net hidden units prior p(θ α) weights | posterior p(θ α, ) p(y X, θ)p(θ α) 0| D0 ∝ | 0 0 | inputs prediction p(y , x , α) = p(y x , θ)p(θ , α) dθ x |D | |D R A neural network with one hidden layer, infinitely

many hidden units and Gaussian priors on the weights y a GP (Neal, 1994). He also analysed infinitely deep → networks. x

Zoubin Ghahramani 38 / 51 Automatic Statistician for Regression and Time-Series Models

Zoubin Ghahramani 39 / 51 INGREDIENTS OFWHATAN WOULD AUTOMATIC AN AUTOMATIC STATISTICIAN STATISTICIAN DO?

Language of models Translation

Data Search Model Prediction Report

Evaluation Checking

I An open-ended language of models James Robert Lloyd and Zoubin Ghahramani 3 / 43 I Expressive enough to capture real-world phenomena. . . I . . . and the techniques used by human statisticians I A search procedure I To efficiently explore the language of models I A principled method of evaluating models I Trading off complexity and fit to data I A procedure to automatically explain the models I Making the assumptions of the models explicit. . . I . . . in a way that is intelligible to non-experts Zoubin Ghahramani 40 / 51 THEATOMSOFOURLANGUAGEOFMODELS

Five base kernels

0

0 0 0 0 Squared Periodic Linear Constant White exp. (SE) (PER) (LIN) (C) noise (WN)

Encoding for the following types of functions

Smooth Periodic Linear Constant Gaussian functions functions functions functions noise

Zoubin Ghahramani 41 / 51 THECOMPOSITIONRULESOFOURLANGUAGE

I Two main operations: addition, multiplication

0 0 quadratic locally LIN LIN SE PER × functions × periodic

0

0 periodic plus periodic plus LIN + PER SE + PER linear trend smooth trend

Zoubin Ghahramani 42 / 51 MODEL SEARCH:MAUNA LOA KEELING CURVE

RQ 60

40 Start

20

0 SE RQ LIN PER

−20 2000 2005 2010

SE + RQ ... PER + RQ ... PER RQ ×

SE + PER + RQ ... SE (PER + RQ) ... ×

......

Zoubin Ghahramani 43 / 51 MODEL SEARCH:MAUNA LOA KEELING CURVE

( Per + RQ ) 40

30 Start

20

10 SE RQ LIN PER

0 2000 2005 2010

SE + RQ ... PER + RQ ... PER RQ ×

SE + PER + RQ ... SE (PER + RQ) ... ×

......

Zoubin Ghahramani 43 / 51 MODEL SEARCH:MAUNA LOA KEELING CURVE

SE × ( Per + RQ ) 50

40 Start

30

20 SE RQ LIN PER

10 2000 2005 2010

SE + RQ ... PER + RQ ... PER RQ ×

SE + PER + RQ ... SE (PER + RQ) ... ×

......

Zoubin Ghahramani 43 / 51 MODEL SEARCH:MAUNA LOA KEELING CURVE

( SE + SE × ( Per + RQ ) ) 60

Start 40

20 SE RQ LIN PER

0 2000 2005 2010

SE + RQ ... PER + RQ ... PER RQ ×

SE + PER + RQ ... SE (PER + RQ) ... ×

......

Zoubin Ghahramani 43 / 51 EXAMPLE:AN ENTIRELY AUTOMATIC ANALYSIS

Raw data Full model posterior with extrapolations 700 700

600 600 500 500 400 400 300 300 200

200 100

100 0 1950 1952 1954 1956 1958 1960 1962 1950 1952 1954 1956 1958 1960 1962

Four additive components have been identified in the data

I A linearly increasing function.

I An approximately periodic function with a period of 1.0 years and with linearly increasing amplitude.

I A smooth function.

I Uncorrelated noise with linearly increasing standard deviation.

Zoubin Ghahramani 44 / 51 EXAMPLE REPORTS

See http://www.automaticstatistician.com

Zoubin Ghahramani 45 / 51 GOODPREDICTIVEPERFORMANCEASWELL

Standardised RMSE over 13 data sets 3.5 3.0 2.5 2.0 Standardised RMSE 1.5 1.0 ABCD ABCD Spectral Trend, cyclical Bayesian Squared Linear accuracy interpretability kernels irregular MKL Eureqa Changepoints Exponential regression

I Tweaks can be made to the algorithm to improve accuracy or interpretability of models produced. . .

I . . . but both methods are highly competitive at extrapolation

Zoubin Ghahramani 46 / 51 Automatic Statistician for Classification

Zoubin Ghahramani 25 / 50 A Sample Report An AutoGPC Report on the Pima Diabetes Dataset2

of variable ‘plasma glucose’ and a 2-way interaction between variables ‘pedigree function’ and ‘age’ (SE + SE SE ). This model can achieve a cross-validated error rate of 22.38% and a 1 3 ◊ 4 negative log marginal likelihood of 353.28. AutoGPC Data Analysis Report on Pima Dataset We have also analysed the relevance of each input variable. They are listed below in Table 1 in descending order of inferred relevance (i.e. in ascending order of cross-validated error of the best one-dimensional GP classifier). 1.0 The Automatic Statistician Table 1: Input variables 24th May 2016

Statistics Classifier Performance ) f Dimension Variable Min Max Mean Kernel NLML Error ( 0.5

1 plasma glucose 44.00 199.00 121.88 smooth 381.22 25.00% =

This report is automatically generated by the AutoGPC. AutoGPC is an automatic Gaussian ⇡ 2 body mass index 18.20 67.10 32.47 smooth 347.18 33.71% process (GP) classifier that aims to find the structure underlying a dataset without human input. 3 pedigree function 0.08 2.42 0.47 smooth 366.12 33.71% For more information, please visit http://github.com/charles92/autogpc. – Baseline – – – constant 373.79 34.39% The training dataset spans 4 input dimensions, which are variables ‘plasma glucose’, ‘body 4 age 21.00 81.00 33.35 smooth 34.54% 0.0 mass index’, ‘pedigree function’ and ‘age’. There is one binary output variable y, where y =0 344.69 (negative) represents ‘not diabetic’, and y =1(positive) represents ‘diabetic’. In the rest of the report, we will first describe the contribution of each individual input The training dataset contains 724 data points. Among them, 249 (34.39%) have positive 50 100 150 200 variable (Section 2). This is followed by a detailed analysis of the additive components that class labels, and the other 475 (65.61%) have negative labels. All input dimensions as well as plasma glucose jointly make up the best model (Section 3). the class label assignments are plotted in Figure 1. Figure 2: Trained classifier on variable ‘plasma glucose’. 2 Individual Variable Analysis

214.5 61.21 2.5541 87.0 First, we try to classify the training samples using only one of the 4 input dimensions. By considering the best classification performance that is achievable in each dimension, we are able to infer which dimensions are the most relevant to the class label assignment. Note that in Table 1 the baseline performance is also included, which is achieved with a constant GP kernel. The effect of the baseline classifier is to indiscriminately classify all samples as either positive or negative, whichever is more frequent in the training data. The classifier performance in each input dimension will be compared against this baseline to tell if the input variable is discriminative in terms of class determination. 1.0

2.1 Variable ‘plasma glucose’ ) f

Variable ‘plasma glucose’ has mean value 121.88 and standard deviation 30.73. Its observed ( 0.5

minimum and maximum are 44.00 and 199.00 respectively. A GP classifier trained on this =

28.5 14.29 -0.1471 15.0 variable alone can achieve a cross-validated error of 25.00%. ⇡ plasma glucose body mass index pedigree function age Compared with the null model (baseline), this variable contains some evidence of class label assignment. There is a positive correlation between the value of this variable and the likelihood of the sample being classified as positive. The GP posterior trained on this variable is plotted in 0.0 Figure 1: The input dataset. Positive samples (‘diabetic’) are coloured red, and negative ones Figure 2. (‘not diabetic’) blue. 0.00.51.01.52.02.5 2.2 Variable ‘pedigree function’ pedigree function Variable ‘pedigree function’ has mean value 0.47 and standard deviation 0.33. Its observed 1 Executive Summary minimum and maximum are 0.08 and 2.42 respectively. A GP classifier trained on this variable Figure 3: Trained classifier on variable ‘pedigree function’. alone can achieve a cross-validated error of 33.71%. The best kernel that we have found relates the class label assignment to input variables ‘plasma This variable provides little evidence of class label assignment given a baseline error rate of glucose’, ‘pedigree function’ and ‘age’. In specific, the model involves an additive combination 34.39%. The GP posterior trained on this variable is plotted in Figure 3.

1 2 3

2.3 Variable ‘body mass index’ 3.1 Component 1

Variable ‘body mass index’ has mean value 32.47 and standard deviation 6.88. Its observed With only one additive component, the GP classifier can achieve a cross-validated error rate minimum and maximum are 18.20 and 67.10 respectively. A GP classifier trained on this vari- of 25.00%. The corresponding negative log marginal likelihood is 381.22. This component able alone can achieve a cross-validated error of 33.71%. operates on variable ‘plasma glucose’, as shown in Figure 6. This variable provides little evidence of class label assignment given a baseline error rate of 34.39%. The GP posterior trained on this variable is plotted in Figure 4. 1.0 1.0

1.0 ) f ( )

0.5 f ( = 0.5 ⇡ = ⇡ ) f (

0.5

= 0.0 ⇡ 0.0

10 20 30 40 50 60 70 80 90 0.0 age 50 100 150 200 plasma glucose

20 30 40 50 60 Figure 5: Trained classifier on variable ‘age’. Figure 6: Trained classifier on variable ‘plasma glucose’. body mass index

Figure 4: Trained classifier on variable ‘body mass index’. 3.2 Component 2 With 2 additive components, the cross-validated error can be reduced by 2.62% to 22.38%. The corresponding negative log marginal likelihood is 353.28. The additional component operates 2.4 Variable ‘age’ on variables ‘pedigree function’ and ‘age’, as shown in Figure 7.

Variable ‘age’ has mean value 33.35 and standard deviation 11.76. Its observed minimum and 90 maximum are 21.00 and 81.00 respectively. A GP classifier trained on this variable alone can Table 2: Classification performance of the full model, its additive components (if any), all input 80 achieve a cross-validated error of 34.54%. variables, and the baseline. 70 The classification performance (in terms of error rate) based on this variable alone is even 0.250 60 worse than that of the naïve baseline classifier. The GP posterior trained on this variable is Dimensions Kernel expression NLML Error 50 plotted in Figure 5. 1, 3, 4 SE1 + SE3 SE4 280.45 21.83% age ◊ 40 1SE1 381.22 25.00% 0.250 3, 4 SE SE 422.79 30.80% 30 0.500 3 ◊ 4 2SE2 347.18 33.71% 20 3 Additive Component Analysis 3SE3 366.12 33.71% 10 0.00.51.01.52.02.5 – C (Baseline) 373.79 34.39% pedigree function The pattern underlying the dataset can be decomposed into 2 additive components, which con- 4SE4 344.69 34.54% (b) Previous and current components combined, tribute jointly to the final classifier which we have trained. With all components in action, the (a) Current additive component involving vari- involving variables ‘plasma glucose’, ‘pedigree classifier can achieve a cross-validated error rate of 22.38%. The performance cannot be further ables ‘pedigree function’ and ‘age’. function’ and ‘age’. improved by adding more components. Figure 7: Trained classifier on variables ‘plasma glucose’, ‘pedigree function’ and ‘age’. In Table 2 we list the full additive model, all input variables, as well as more complex additive components (if any) considered above, ranked by their cross-validated error rate.

4 5 6

2Obtained from [Smith et al., 1988] Qiurui He (CUED) AutoGPC 1 June 2016 12 / 19

Zoubin Ghahramani 26 / 50 A Sample Report AutoGPC Report: Executive Summary

ofI variableFull model ‘plasma description glucose’ and a 2-way interaction between variables ‘pedigree function’ and ‘age’ (SE + SE SE ). This model can achieve a cross-validated error rate of 22.38% and a . . . ,1 the model3 ◊ involves4 an additive combination of variable ‘plasma glu- negativecose’ log and marginal a 2-way likelihood interaction of 353.28. between variables ‘pedigree function’ and ‘age’ We have also analysed the relevance of each input variable. They are listed below in Table (SE + SE SE ). This model can achieve a cross-validated error rate of 1 in descending1 order3 ⇥ of inferred4 relevance (i.e. in ascending order of cross-validated error of 22.38% and a negative log marginal likelihood of 353.28. the best one-dimensional GP classifier).

I Summary of input variablesTable 1: Input variables

Statistics Classifier Performance Variable Min Max Mean Kernel NLML Error 1 plasma glucose 44.00 199.00 121.88 smooth 381.22 25.00% 2 body mass index 18.20 67.10 32.47 smooth 347.18 33.71% 3 pedigree function 0.08 2.42 0.47 smooth 366.12 33.71% – Baseline – – – constant 373.79 34.39% 4 age 21.00 81.00 33.35 smooth 344.69 34.54%

Qiurui He (CUED) TableAutoGPC 2: Input variables 1 June 2016 13 / 19 Statistics Classifier Performance Zoubin Ghahramani 27 / 50 Dimension Variable Min Max Mean Kernel NLML Error 1 plasma glucose 44.00 199.00 121.88 smooth 381.22 25.00% 2 body mass index 18.20 67.10 32.47 smooth 347.18 33.71% 3 pedigree function 0.08 2.42 0.47 smooth 366.12 33.71% – Baseline – – – constant 373.79 34.39% 4 age 21.00 81.00 33.35 smooth 344.69 34.54%

In the rest of the report, we will first describe the contribution of each individual input variable (Section 2). This is followed by a detailed analysis of the additive components that jointly make up the best model (Section 3).

2 Individual Variable Analysis

First, we try to classify the training samples using only one of the 4 input dimensions. By considering the best classification performance that is achievable in each dimension, we are able to infer which dimensions are the most relevant to the class label assignment. Note that in Table 1 the baseline performance is also included, which is achieved with a constant GP kernel. The effect of the baseline classifier is to indiscriminately classify all samples as either positive or negative, whichever is more frequent in the training data. The classifier performance in each input dimension will be compared against this baseline to tell if the input variable is discriminative in terms of class determination.

2.1 Variable ‘plasma glucose’ Variable ‘plasma glucose’ has mean value 121.88 and standard deviation 30.73. Its observed minimum and maximum are 44.00 and 199.00 respectively. A GP classifier trained on this variable alone can achieve a cross-validated error of 25.00%.

2 A Sample Report AutoGPC Report: Individual Variable Analysis

I Variable ‘plasma glucose’: moderate evidence, monotonic . . . Compared with the null model (baseline), this variable contains some ev- idence of class label assignment. There is a positive correlation between the value of this variable and the likelihood of the sample being classified as pos- itive.

I Variable ‘body mass index’: little evidence . . . This variable provides little evidence of class label assignment given a baseline error rate of 34.39%.

I Variable ‘age’: no evidence . . . The classification performance (in terms of error rate) based on this vari- able alone is even worse than that of the naïve baseline classifier.

I Other variants: strong evidence, etc.

Qiurui He (CUED) AutoGPC 1 June 2016 14 / 19

Zoubin Ghahramani 28 / 50 A Sample Report AutoGPC Report: Additive Component Analysis

I For example, for the second additive component [With only one additive component, . . . ] With 2 additive components, the cross-validated error can be reduced by 2.62% to 22.38%.Figure The 5: corresponding Trained classifier negative on variable log ‘age’. marginal likelihood is 353.28. The additional component operates on variables ‘pedigree function’ and ‘age’, as shown in Figure . . . Table 2: Classification performance of the full model, its additive components (if any), all input Ivariables,Summary and the of baseline. all models

Dim Kernel expression NLML Error 1, 3, 4 SE + SE SE 280.45 21.83% 1 3 ◊ 4 1SE1 381.22 25.00% 3, 4 SE SE 422.79 30.80% 3 ◊ 4 2SE2 347.18 33.71% 3SE3 366.12 33.71% – C (Baseline) 373.79 34.39% 4SE4 344.69 34.54%

Qiurui He (CUED) AutoGPC 1 June 2016 15 / 19 Table 3: Classification performance of the full model, its additive components (if any), all input Zoubin Ghahramani 29 / 50 variables, and the baseline.

Dimensions Kernel expression NLML Error 1, 3, 4 SE + SE SE 280.45 21.83% 1 3 ◊ 4 1SE1 381.22 25.00% 3, 4 SE SE 422.79 30.80% 3 ◊ 4 2SE2 347.18 33.71% 3SE3 366.12 33.71% – C (Baseline) 373.79 34.39% 4SE4 344.69 34.54%

5 MODEL CHECKINGAND CRITICISM

I Good statistical modelling should include model criticism:

I Does the data match the assumptions of the model?

I Our automatic statistician does posterior predictive checks, dependence tests and residual tests I We have also been developing more systematic nonparametric approaches to model criticism using kernel two-sample testing: Lloyd, J. R., and Ghahramani, Z. (2015) Statistical Model Criticism using → Kernel Two Sample Tests. NIPS 2015.

Zoubin Ghahramani 47 / 51 THE AUTOML COMPETITION

I New algorithms for building machine learning systems that learn under strict time, CPU, memory and disk space constraints, making decisions about where to allocate computational resources so as to maximise statistical performance.

I Second and First place in the first two rounds of the AutoML classification challenge to “design machine learning methods capable of performing all model selection and parameter tuning

Zoubin Ghahramaniwithout any human intervention.” 48 / 51 CONCLUSIONS

Probabilistic modelling offers a framework for building systems that reason about uncertainty and learn from data, going beyond traditional pattern recognition problems.

I have briefly reviewed some of the frontiers of our research, centred around the theme of automating machine learning, including: I The automatic statistician I Probabilistic programming I Bayesian optimisation

Ghahramani, Z. (2015) Probabilistic machine learning and artificial intelligence. Nature 521:452–459. http://www.nature.com/nature/journal/v521/n7553/full/nature14541.html Uber AI Labs is hiring!

Zoubin Ghahramani 49 / 51 COLLABORATORS

Ryan P. Adams James R. Lloyd Yutian Chen David J. C. MacKay David Duvenaud Adam Scibior´ Yarin Gal Amar Shah Hong Ge Emma Smith Michael A. Gelbart Christian Steinruecken Roger Grosse Joshua B. Tenenbaum José Miguel Hernández-Lobato Andrew G. Wilson Matthew W. Hoffman Kai Xu

Zoubin Ghahramani 50 / 51 PAPERS

General: Ghahramani, Z. (2013) Bayesian nonparametrics and the probabilistic approach to modelling. Philosophical Trans. Royal Society A 371: 20110553. Ghahramani, Z. (2015) Probabilistic machine learning and artificial intelligence Nature 521:452–459. http://www.nature.com/nature/journal/v521/n7553/full/nature14541.html Automatic Statistician: Website: http://www.automaticstatistician.com Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2013) Structure Discovery in Nonparametric Regression through Compositional Kernel Search. ICML 2013. Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B. and Ghahramani, Z. (2014) Automatic Construction and Natural-language Description of Nonparametric Regression Models AAAI 2014. http://arxiv.org/pdf/1402.4304v2.pdf Lloyd, J. R., and Ghahramani, Z. (2015) Statistical Model Criticism using Kernel Two Sample Tests. http://mlg.eng.cam.ac.uk/Lloyd/papers/kernel-model-checking.pdf. NIPS 2015. Bayesian Optimisation: Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. (2014) Predictive entropy search for efficient global optimization of black-box functions. NIPS 2014 Hernández-Lobato, J.M., Gelbart, M.A., Adams, R.P., Hoffman, M.W., Ghahramani, Z. (2016) A General Framework for Constrained Bayesian Optimization using Information-based Search. Journal of Machine Learning Research. 17(160):1–53.

Zoubin Ghahramani 51 / 51 PAPERS II

Probabilistic Programming: Turing: https://github.com/yebai/Turing.jl Chen, Y., Mansinghka, V., Ghahramani, Z. (2014) Sublinear-Time Approximate MCMC Transitions for Probabilistic Programs. arXiv:1411.1690 Ge, Hong, Adam Scibior, and Zoubin Ghahramani (2016) Turing: rejuvenating probabilistic programming in Julia. (In preparation). Bayesian neural networks: José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. ICML, 2015. Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. ICML, 2016. Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. NIPS, 2016. José Miguel Hernández-Lobato, Yingzhen Li, Daniel Hernández-Lobato, Thang Bui, and Richard E Turner. Black-box alpha divergence minimization. ICML, 2016.

Zoubin Ghahramani 52 / 51