A Brief Overview of Probability Theory in Data Science by Geert

A brief overview of probability theory in data science Geert Verdoolaege 1Department of Applied Physics, Ghent University, Ghent, Belgium 2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium Tutorial 3rd IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis, 27-05-2019 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 2 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 3 Early history of probability Earliest traces in Western civilization: Jewish writings, Aristotle Notion of probability in law, based on evidence Usage in finance Usage and demonstration in gambling 4 Middle Ages World is knowable but uncertainty due to human ignorance William of Ockham: Ockham’s razor Probabilis: a supposedly ‘provable’ opinion Counting of authorities Later: degree of truth, a scale Quantification: Law, faith ! Bayesian notion Gaming ! frequentist notion 5 Quantification 17th century: Pascal, Fermat, Huygens Comparative testing of hypotheses Population statistics 1713: Ars Conjectandi by Jacob Bernoulli: Weak law of large numbers Principle of indifference De Moivre (1718): The Doctrine of Chances 6 Bayes and Laplace Paper by Thomas Bayes (1763): inversion of binomial distribution Pierre Simon Laplace: Practical applications in physical sciences 1812: Théorie Analytique des Probabilités 7 Frequentism and sampling theory George Boole (1854), John Venn (1866) Sampling from a ‘population’ Notion of ‘ensembles’ Andrey Kolmogorov (1930s): axioms of probability Early 20th century classical statistics: Ronald Fisher, Karl Pearson, Jerzy Neyman, Egon Pearson Applications in social sciences, medicine, natural sciences 8 Maximum entropy and Bayesian methods Statistical mechanics with Ludwig Boltzmann (1868): maximum entropy energy distribution Josiah Willard Gibbs (1874–1878) Claude Shannon (1948): maximum entropy in frequentist language Edwin Thompson Jaynes: Bayesian re-interpretation Harold Jeffreys (1939): standard work on (Bayesian) probability Richard T. Cox (1946): derivation of probability laws Computers: flowering of Bayesian methods 9 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 10 Frequentist probability Probability = frequency Straightforward: Number of 1s in 60 dice throws ≈ 10: p = 1/6 Probability of plasma disruption p ≈ Ndisr/Ntot Less straightforward: Probability of fusion electricity by 2050? Probability of mass of Saturn 90 mA ≤ mS < 100 mA? 11 Populations vs. sample 12 Statistics E.g. weight w of Belgian men: unknown but fixed for every individual Average weight mw in population? Random variable W Sample: W1, W2,..., Wn Average weight: statistic (estimator) W Central limit theorem: p W ∼ p(Wjmw, sw) ) W ∼ N (mw, sw/ n) 13 Maximum likelihood parameter estimation Maximum likelihood (ML) principle: mˆw = arg max p(W1,..., Wnjmw, sw) + mw2R n 1 (W − m )2 ≈ p − i w arg max ∏ exp 2 + 2s mw2R i=1 2psw w " # 1 n (W − m )2 = p − i w arg max exp ∑ 2 + 2s mw2R 2psw i=1 w ML estimator (known sw): 1 n mˆw = W = ∑ Wi n i=1 14 Frequentist hypothesis testing Weight of Dutch men compared to Belgian men (populations) Observed sample averages WNL, WBE Null hypothesis H0: mw,NL = mw,BE Test statistic: W − W NL BE ∼ N (0, 1) (under H ) s 0 WCZ−WBE 15 Overview 1 Origins of probability 2 Frequentist methods and statistics 3 Principles of Bayesian probability theory 4 Monte Carlo computational methods 5 Applications Classification Regression analysis 6 Conclusions and references 16 Probability theory: quantifying uncertainty Statistical (aleatoric) vs. epistemic uncertainty? Every piece of information has uncertainty Uncertainty = lack of information Observation may reduce uncertainty Probability (distribution) quantifies uncertainty 17 Example: physical sciences Measurement of physical quantity Origin of stochasticity: Apparatus Microscopic fluctuations Systematic uncertainty is assigned a probability distribution E.g. coin tossing, voltage measurement, probability of hypothesis vs. another, . Bayesian: no random variables 18 What is probability? Objective Bayesian view Probability = real number 2 [0, 1] Always conditioned on known information Notation: p(AjB) or p(AjI) Extension of logic: measure of degree to which B implies A Degree of plausibility, but subject to consistency Same information ) same probabilities Probability distribution: outcome ! probability 19 Joint, marginal and conditional distributions p(x, y), p(x), p(y), p(xjy), p(yjx) 20 Example: normal distribution Normal/Gaussian probability density function (PDF): 1 (x − m)2 p(xjm, s, I) = p exp − 2ps 2s2 Probability x1 ≤ x < x1 + dx Inverse problem: m, s given x? 21 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Updating information states Bayes’ theorem x = data vector p(xjq, I)p(qjI) p(qjx, I)= q = vector of model parameters p(xjI) I = implicit knowledge Likelihood: misfit between model and data Prior distribution: ‘expert’ or diffuse knowledge Evidence: Z Z p(xjI) = p(x, qjI) dq = p(xjq, I)p(qjI) dq Posterior distribution 22 Practical considerations Updating state of knowledge = learning ‘Uninformative’ prior (distribution) Assigning uninformative priors: Transformation invariance (Jaynes ’68): e.g. uniform for location variable Show me the prior! Principle of indifference Testable information: maximum entropy ... 23 Mean of a normal distribution: uniform prior n measurements xi ! x Independent and identically distributed xi: n 1 (x − m)2 ( j ) = p − i p x m, s, I ∏ exp 2 i=1 2ps 2s Bayes’ rule: p(m, sjx, I) ∝ p(xjm, s, I)p(m, sjI) Suppose s ≡ se ! delta function Assume m 2 [mmin, mmax] ! uniform prior: 8 1 < , if m 2 [mmin, mmax] p(mjI) = mmax − mmin :0, otherwise Let mmin ! −¥, mmax ! +¥ ) improper prior Ensure proper posterior 24 Posterior for m Posterior: n (x − m)2 ( j ) − ∑i=1 i p m x, I ∝ exp 2 2se Define n n 1 2 1 2 x ≡ ∑ xi, (Dx) ≡ ∑(xi − x¯) n i=1 n i=1 Adding and subtracting 2nx2 (‘completing the square’), 1 h i ( j ) − ( − )2 + ( )2 p m x, I ∝ exp 2 m x Dx 2se /n Retaining dependence on m, (m − x)2 ( j ) − p m x, I ∝ exp 2 2se /n 2 m ∼ N (x, se /n) 25 Mean of a normal distribution: normal prior 2 Normal prior: m ∼ N (m0, t ) Posterior: n (x − m)2 (m − m )2 ( j ) − ∑i=1 i × − 0 p m x, I ∝ exp 2 exp 2 2se 2t Expanding and completing the square, 2 m ∼ N (mn, sn ), where − n 1 n 1 1 ≡ 2 + ≡ 2 + mn sn 2 x 2 m0 and mn sn 2 2 se t se t mn is weighted average of m0 and x 26 Care for the prior Intrinsic part of Bayesian theory, enabling coherent learning High-quality data may be scarce Regularization of ill-posed problems (e.g. tomography) Prior can retain influence: e.g. regression with errors in all variables 27 Unknown mean and standard deviation Repeated measurements ! information on s Scale variable s ! Jeffreys’ scale prior: 1 p(sjI) ∝ , s 2 ]0, +¥[ s Posterior: " # 1 (m − x)2 + (Dx)2 1 p(m, sjx, I) ∝ exp − × sn 2s2/n s 28 Marginal posterior for m (1) Marginalization = integrating out a (nuisance) parameter: Z +¥ p(mjx, I) = p(m, sjx, I) ds 0 " #− n Z +¥ 2 2 2 1 (m − x) + (Dx) n −1 −s ∝ s 2 e ds 0 2 2/n " #− n 1 n (m − x)2 + (Dx)2 2 = G , 2 2 2/n where (m − x)2 + (Dx)2 s ≡ 2s2/n 29 Marginal posterior for m (2) After normalization: − n n " 2 # 2 G 2 (m − x) p(mjx, I) = q 1 + 2 n−1 (Dx)2 p(Dx) G 2 Changing variables, (m − x)2 t ≡ q , with p(tjx, I)dt ≡ p(mjx, I)dm, (Dx)2/(n − 1) − n G n t2 2 p(tjx, I) = 2 1 + p n−1 n − 1 (n − 1)pG 2 Student’s t-distribution with parameter n = n − 1 If n 1, " # 1 (m − x)2 p(mjx, I) −! q exp − ( )2 2(Dx)2/n 2p Dx /n 30 Marginal posterior for s Marginalization of m: Z +¥ p(sjx, I) = p(m, sjx, I) dm −¥ " # 1 (Dx)2 ∝ exp − sn 2s2/n Setting X ≡ n(Dx)2/s2, 1 k −1 − X ( jx ) = 2 2 ≡ − p X , I k X e , k n 1 2 k 2 G 2 c2 distribution with parameter k 31 The Laplace approximation (1) Laplace (saddle point) approximation of distributions around the mode (= maximum) E.g. marginal for m: − n n " 2 # 2 G 2 (m − x) p(mjx, I) = q 1 + 2 n−1 (Dx)2 p(Dx) G 2 Taylor expansion around mode: 1 d2(ln p) [ ( j )] ≈ [ ( j )] + ( − )2 ln p m x, I ln p x x, I 2 m x 2 dm m=x h ni n − 1 = ln G − ln G 2 2 1 h i n − ln p(Dx)2 − (m − x)2 2 2(Dx)2 32 The Laplace approximation

A Brief Overview of Probability Theory in Data Science by Geert

Creating Modern Probability. Its Mathematics, Physics and Philosophy in Historical Perspective

There Is No Pure Empirical Reasoning

The Interpretation of Probability: Still an Open Issue? 1

Bayesian Versus Frequentist Statistics for Uncertainty Analysis

Practical Statistics for Particle Physics Lecture 1 AEPS2018, Quy Nhon, Vietnam

This History of Modern Mathematical Statistics Retraces Their Development

Statistical Modelling

“To Be a Good Mathematician, You Need Inner Voice” ”Огонёкъ” Met with Yakov Sinai, One of the World’S Most Renowned Mathematicians

The Likelihood Principle

Excerpts from Kolmogorov's Diary

Randomness and Computation

Introduction to Bayesian Estimation