A Brief Overview of Probability Theory in Data Science by Geert
Total Page:16
File Type:pdf, Size:1020Kb
A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent University, Ghent, Belgium 2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium Tutorial 3rd IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis, 27-05-2019 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 2 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 3 Early history of probability Earliest traces in Western civilization: Jewish writings, Aristotle Notion of probability in law, based on evidence Usage in finance Usage and demonstration in gambling 4 Middle Ages World is knowable but uncertainty due to human ignorance William of Ockham: Ockham’s razor Probabilis: a supposedly ‘provable’ opinion Counting of authorities Later: degree of truth, a scale Quantification: Law, faith ! Bayesian notion Gaming ! frequentist notion 5 Quantification 17th century: Pascal, Fermat, Huygens Comparative testing of hypotheses Population statistics 1713: Ars Conjectandi by Jacob Bernoulli: Weak law of large numbers Principle of indifference De Moivre (1718): The Doctrine of Chances 6 Bayes and Laplace Paper by Thomas Bayes (1763): inversion of binomial distribution Pierre Simon Laplace: Practical applications in physical sciences 1812: Théorie Analytique des Probabilités 7 Frequentism and sampling theory George Boole (1854), John Venn (1866) Sampling from a ‘population’ Notion of ‘ensembles’ Andrey Kolmogorov (1930s): axioms of probability Early 20th century classical statistics: Ronald Fisher, Karl Pearson, Jerzy Neyman, Egon Pearson Applications in social sciences, medicine, natural sciences 8 Maximum entropy and Bayesian methods Statistical mechanics with Ludwig Boltzmann (1868): maximum entropy energy distribution Josiah Willard Gibbs (1874–1878) Claude Shannon (1948): maximum entropy in frequentist language Edwin Thompson Jaynes: Bayesian re-interpretation Harold Jeffreys (1939): standard work on (Bayesian) probability Richard T. Cox (1946): derivation of probability laws Computers: flowering of Bayesian methods 9 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 10 Frequentist probability Probability = frequency Straightforward: Number of 1s in 60 dice throws ≈ 10: p = 1/6 Probability of plasma disruption p ≈ Ndisr/Ntot Less straightforward: Probability of fusion electricity by 2050? Probability of mass of Saturn 90 mA ≤ mS < 100 mA? 11 Populations vs. sample 12 Statistics E.g. weight w of Belgian men: unknown but fixed for every individual Average weight mw in population? Random variable W Sample: W1, W2,..., Wn Average weight: statistic (estimator) W Central limit theorem: p W ∼ p(Wjmw, sw) ) W ∼ N (mw, sw/ n) 13 Maximum likelihood parameter estimation Maximum likelihood (ML) principle: mˆw = arg max p(W1,..., Wnjmw, sw) + mw2R n 1 (W − m )2 ≈ p − i w arg max ∏ exp 2 + 2s mw2R i=1 2psw w " # 1 n (W − m )2 = p − i w arg max exp ∑ 2 + 2s mw2R 2psw i=1 w ML estimator (known sw): 1 n mˆw = W = ∑ Wi n i=1 14 Frequentist hypothesis testing Weight of Dutch men compared to Belgian men (populations) Observed sample averages WNL, WBE Null hypothesis H0: mw,NL = mw,BE Test statistic: W − W NL BE ∼ N (0, 1) (under H ) s 0 WCZ−WBE 15 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 16 Probability theory: quantifying uncertainty Statistical (aleatoric) vs. epistemic uncertainty? Every piece of information has uncertainty Uncertainty = lack of information Observation may reduce uncertainty Probability (distribution) quantifies uncertainty 17 Example: physical sciences Measurement of physical quantity Origin of stochasticity: Apparatus Microscopic fluctuations Systematic uncertainty is assigned a probability distribution E.g. coin tossing, voltage measurement, probability of hypothesis vs. another, . Bayesian: no random variables 18 What is probability? Objective Bayesian view Probability = real number 2 [0, 1] Always conditioned on known information Notation: p(AjB) or p(AjI) Extension of logic: measure of degree to which B implies A Degree of plausibility, but subject to consistency Same information ) same probabilities Probability distribution: outcome ! probability 19 Joint, marginal and conditional distributions p(x, y), p(x), p(y), p(xjy), p(yjx) 20 Example: normal distribution Normal/Gaussian probability density function (PDF): 1 (x − m)2 p(xjm, s, I) = p exp − 2ps 2s2 Probability x1 ≤ x < x1 + dx Inverse problem: m, s given x? 21 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Practical considerations Updating state of knowledge = learning ‘Uninformative’ prior (distribution) Assigning uninformative priors: Transformation invariance (Jaynes ’68): e.g. uniform for location variable Show me the prior! Principle of indifference Testable information: maximum entropy ... 23 Mean of a normal distribution: uniform prior n measurements xi ! x Independent and identically distributed xi: n 1 (x − m)2 ( j ) = p − i p x m, s, I ∏ exp 2 i=1 2ps 2s Bayes’ rule: p(m, sjx, I) ∝ p(xjm, s, I)p(m, sjI) Suppose s ≡ se ! delta function Assume m 2 [mmin, mmax] ! uniform prior: 8 1 < , if m 2 [mmin, mmax] p(mjI) = mmax − mmin :0, otherwise Let mmin ! −¥, mmax ! +¥ ) improper prior Ensure proper posterior 24 Posterior for m Posterior: n (x − m)2 ( j ) − ∑i=1 i p m x, I ∝ exp 2 2se Define n n 1 2 1 2 x ≡ ∑ xi, (Dx) ≡ ∑(xi − x¯) n i=1 n i=1 Adding and subtracting 2nx2 (‘completing the square’), 1 h i ( j ) − ( − )2 + ( )2 p m x, I ∝ exp 2 m x Dx 2se /n Retaining dependence on m, (m − x)2 ( j ) − p m x, I ∝ exp 2 2se /n 2 m ∼ N (x, se /n) 25 Mean of a normal distribution: normal prior 2 Normal prior: m ∼ N (m0, t ) Posterior: n (x − m)2 (m − m )2 ( j ) − ∑i=1 i × − 0 p m x, I ∝ exp 2 exp 2 2se 2t Expanding and completing the square, 2 m ∼ N (mn, sn ), where − n 1 n 1 1 ≡ 2 + ≡ 2 + mn sn 2 x 2 m0 and mn sn 2 2 se t se t mn is weighted average of m0 and x 26 Care for the prior Intrinsic part of Bayesian theory, enabling coherent learning High-quality data may be scarce Regularization of ill-posed problems (e.g. tomography) Prior can retain influence: e.g. regression with errors in all variables 27 Unknown mean and standard deviation Repeated measurements ! information on s Scale variable s ! Jeffreys’ scale prior: 1 p(sjI) ∝ , s 2 ]0, +¥[ s Posterior: " # 1 (m − x)2 + (Dx)2 1 p(m, sjx, I) ∝ exp − × sn 2s2/n s 28 Marginal posterior for m (1) Marginalization = integrating out a (nuisance) parameter: Z +¥ p(mjx, I) = p(m, sjx, I) ds 0 " #− n Z +¥ 2 2 2 1 (m − x) + (Dx) n −1 −s ∝ s 2 e ds 0 2 2/n " #− n 1 n (m − x)2 + (Dx)2 2 = G , 2 2 2/n where (m − x)2 + (Dx)2 s ≡ 2s2/n 29 Marginal posterior for m (2) After normalization: − n n " 2 # 2 G 2 (m − x) p(mjx, I) = q 1 + 2 n−1 (Dx)2 p(Dx) G 2 Changing variables, (m − x)2 t ≡ q , with p(tjx, I)dt ≡ p(mjx, I)dm, (Dx)2/(n − 1) − n G n t2 2 p(tjx, I) = 2 1 + p n−1 n − 1 (n − 1)pG 2 Student’s t-distribution with parameter n = n − 1 If n 1, " # 1 (m − x)2 p(mjx, I) −! q exp − ( )2 2(Dx)2/n 2p Dx /n 30 Marginal posterior for s Marginalization of m: Z +¥ p(sjx, I) = p(m, sjx, I) dm −¥ " # 1 (Dx)2 ∝ exp − sn 2s2/n Setting X ≡ n(Dx)2/s2, 1 k −1 − X ( jx ) = 2 2 ≡ − p X , I k X e , k n 1 2 k 2 G 2 c2 distribution with parameter k 31 The Laplace approximation (1) Laplace (saddle point) approximation of distributions around the mode (= maximum) E.g. marginal for m: − n n " 2 # 2 G 2 (m − x) p(mjx, I) = q 1 + 2 n−1 (Dx)2 p(Dx) G 2 Taylor expansion around mode: 1 d2(ln p) [ ( j )] ≈ [ ( j )] + ( − )2 ln p m x, I ln p x x, I 2 m x 2 dm m=x h ni n − 1 = ln G − ln G 2 2 1 h i n − ln p(Dx)2 − (m − x)2 2 2(Dx)2 32 The Laplace approximation