Bayesian Learning
Total Page:16
File Type:pdf, Size:1020Kb
SC4/SM8 Advanced Topics in Statistical Machine Learning Bayesian Learning Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 1 / 14 Bayesian Learning Review of Bayesian Inference The Bayesian Learning Framework Bayesian learning: treat parameter vector θ as a random variable: process of learning is then computation of the posterior distribution p(θjD). In addition to the likelihood p(Djθ) need to specify a prior distribution p(θ). Posterior distribution is then given by the Bayes Theorem: p(Djθ)p(θ) p(θjD) = p(D) Likelihood: p(Djθ) Posterior: p(θjD) Prior: p(θ) Marginal likelihood: p(D) = Θ p(Djθ)p(θ)dθ ´ Summarizing the posterior: MAP Posterior mode: θb = argmaxθ2Θ p(θjD) (maximum a posteriori). mean Posterior mean: θb = E [θjD]. Posterior variance: Var[θjD]. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 2 / 14 Bayesian Learning Review of Bayesian Inference Bayesian Inference on the Categorical Distribution Suppose we observe the with yi 2 f1;:::; Kg, and model them as i.i.d. with pmf π = (π1; : : : ; πK): n K Y Y nk p(Djπ) = πyi = πk i=1 k=1 Pn PK with nk = i=1 1(yi = k) and πk > 0, k=1 πk = 1. The conjugate prior on π is the Dirichlet distribution Dir(α1; : : : ; αK) with parameters αk > 0, and density PK K Γ( αk) Y p(π) = k=1 παk−1 QK k k=1 Γ(αk) k=1 PK on the probability simplex fπ : πk > 0; k=1 πk = 1g. The posterior is also Dirichlet Dir(α1 + n1; : : : ; αK + nK). Posterior mean is α + n πmean = k k : bk PK j=1 αj + nj Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 3 / 14 Bayesian Learning Review of Bayesian Inference Dirichlet Distributions (A) Support of the Dirichlet density for K = 3. (B) Dirichlet density for αk = 10. (C) Dirichlet density for αk = 0:1. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 4 / 14 Bayesian Learning Bayesian Naïve Bayes Classifier Naïve Bayes Consider the classification example with naïve Bayes classifier: p Y x(j) (j) i 1−xi p(xijφk) = φkj (1 − φkj) : j=1 Pn Pn (j) Set nk = i=1 1fyi = kg, nkj = i=1 1fyi = k; xi = 1g. MLEs are: (j) n P x n k ^ i:yi=k i kj π^k = ; φkj = = : n nk nk A problem: if the `-th word did not appear in documents labelled as class ^ k then φk` = 0 and P(Y = kjX = x with `-th entry equal to 1) p x(j) 1−x(j) Y ^ ^ / π^k φkj 1 − φkj = 0 j=1 i.e. we will never attribute a new document containing word ` to class k (regardless of other words in it). Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 5 / 14 Bayesian Learning Bayesian Naïve Bayes Classifier Bayesian Inference on Naïve Bayes model Under the Naïve Bayes model, the joint distribution of labels p yi 2 f1;:::; Kg and data vectors xi 2 f0; 1g is 1(yi=k) n n K 0 p 1 Y Y Y Y x(j) (j) i 1−xi p(Djθ) = p(xi; yijθ) = @πk φkj (1 − φkj) A i=1 i=1 k=1 j=1 K p Y nk Y nkj nk−nkj = πk φkj (1 − φkj) k=1 j=1 Pn Pn (j) where nk = i=1 1(yi = k), nkj = i=1 1(yi = k; xi = 1). K For conjugate prior, we can use Dir((αk)k=1) for π, and Beta(a; b) for φkj independently. Because the likelihood factorises, the posterior distribution over π and K (φkj) also factorises, and posterior for π is Dir((αk + nk)k=1), and for φkj is Beta(a + nkj; b + nk − nkj). Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 6 / 14 Bayesian Learning Bayesian Naïve Bayes Classifier Bayesian Inference on Naïve Bayes model n Given D = f(xi; yi)gi=1, want to predict a label ~y for a new document ~x. We can calculate p(~x; ~y = kjD) = p(~y = kjD)p(~xj~y = k; D) with α + n a + n p(~y = kjD) = k k ; p(~x(j) = 1j~y = k; D) = kj : PK a + b + n l=1 αl + n k Predicted class is p(~y = kjD)p(~xj~y = k; D) p(~y = kj~x; D) = p(~xjD) p ~x(j) 1−~x(j) αk + nk Y a + nkj b + nk − nkj / K P a + b + nk a + b + nk l=1 αl + n j=1 Compared to ML plug-in estimator, pseudocounts help to “regularize” probabilities away from extreme values. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 7 / 14 Bayesian Learning Bayesian Model Selection Bayesian Learning and Regularisation Consider a Bayesian approach to logistic regression: introduce a multivariate normal prior for weight vector w 2 Rp, and a uniform (improper) prior for offset b 2 R. The prior density is: 2 − p 1 2 p(b; w) = 1 · (2πσ ) 2 exp − kwk 2σ2 2 The posterior is n ! 1 X > p(b; wjD) / exp − kwk2 − log(1 + exp(−y (b + w x ))) 2σ2 2 i i i=1 The posterior mode is equivalent to minimising the L2-regularised empirical risk. Regularised empirical risk minimisation is (often) equivalent to having a prior and finding a MAP estimate of the parameters. L2 regularisation - multivariate normal prior. L1 regularisation - multivariate Laplace prior. From a Bayesian perspective, the MAP parameters are just one way to summarise the posterior distribution. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 8 / 14 Bayesian Learning Bayesian Model Selection Bayesian Model Selection A model M with a given set of parameters θM consists of both the likelihood p(DjθM) and the prior distribution p(θM). The posterior distribution p(Djθ ; M)p(θ jM) p(θ jD; M) = M M M p(DjM) Marginal probability of the data under M (Bayesian model evidence): p(DjM) = p(DjθM; M)p(θMjM)dθ ˆΘ p(DjM) Compare models using their Bayes factors p(DjM0) Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 9 / 14 Bayesian Learning Bayesian Model Selection Bayesian Occam’s Razor Occam’s Razor: of two explanations adequate to explain the same set of observations, the simpler should be preferred. p(DjM) = p(DjθM; M)p(θMjM)dθ ˆΘ Model evidence p(DjM) is the probability that a set of randomly selected parameter values inside the model would generate dataset D. Models that are too simple are unlikely to generate the observed dataset. Models that are too complex can generate many possible dataset, so again, they are unlikely to generate that particular dataset at random. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 10 / 14 Bayesian Learning Bayesian Model Selection Bayesian model comparison: Occam’s razor at work M = 0 M = 1 M = 2 M = 3 40 40 40 40 Model Evidence 1 20 20 20 20 0.8 0 0 0 0 −20 −20 −20 −20 0.6 0 5 10 0 5 10 0 5 10 0 5 10 M = 4 M = 5 M = 6 M = 7 P(Y|M) 0.4 40 40 40 40 0.2 20 20 20 20 0 0 1 2 3 4 5 6 7 0 0 0 0 M −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 figures by M.Sahani Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 11 / 14 Bayesian Learning Bayesian Model Selection Bayesian computation Most posteriors are intractable, and posterior approximations need to be used. Laplace approximation. Variational methods (variational Bayes, expectation propagation). Monte Carlo methods (MCMC and SMC). Approximate Bayesian Computation (ABC). Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 12 / 14 Bayesian Learning Discussion and Further Reading Bayesian Learning – Discussion Use probability distributions to reason about uncertainties of parameters (latent variables and parameters are treated in the same way). Model consists of the likelihood function and the prior distribution on parameters: allows to integrate prior beliefs and domain knowledge. Prior usually has hyperparameters, i.e., p(θ) = p(θj ). How to choose ? Be Bayesian about as well — choose a hyperprior p( ) and compute p( jD): integrate the predictive posterior over hyperparameters. ^ Maximum Likelihood II — = argmax 2Ψ p(Dj ). p(Dj ) = p(Djθ)p(θj )dθ ˆ p(Dj )p( ) p( jD) = p(D) Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 13 / 14 Bayesian Learning Discussion and Further Reading Bayesian Learning – Further Reading Videolectures by Zoubin Ghahramani: Bayesian Learning Murphy, Chapter 5 Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 14 / 14.