A brief overview of probability theory in data science
Geert Verdoolaege 1Department of Applied Physics, Ghent University, Ghent, Belgium 2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium Tutorial 3rd IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis, 27-05-2019 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
2 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
3 Early history of probability
Earliest traces in Western civilization: Jewish writings, Aristotle
Notion of probability in law, based on evidence
Usage in finance
Usage and demonstration in gambling
4 Middle Ages
World is knowable but uncertainty due to human ignorance
William of Ockham: Ockham’s razor
Probabilis: a supposedly ‘provable’ opinion
Counting of authorities
Later: degree of truth, a scale
Quantification:
Law, faith → Bayesian notion
Gaming → frequentist notion
5 Quantification
17th century: Pascal, Fermat, Huygens
Comparative testing of hypotheses
Population statistics
1713: Ars Conjectandi by Jacob Bernoulli:
Weak law of large numbers
Principle of indifference
De Moivre (1718): The Doctrine of Chances
6 Bayes and Laplace
Paper by Thomas Bayes (1763): inversion of binomial distribution Pierre Simon Laplace: Practical applications in physical sciences 1812: Théorie Analytique des Probabilités
7 Frequentism and sampling theory
George Boole (1854), John Venn (1866)
Sampling from a ‘population’
Notion of ‘ensembles’
Andrey Kolmogorov (1930s): axioms of probability
Early 20th century classical statistics: Ronald Fisher, Karl Pearson, Jerzy Neyman, Egon Pearson
Applications in social sciences, medicine, natural sciences
8 Maximum entropy and Bayesian methods
Statistical mechanics with Ludwig Boltzmann (1868): maximum entropy energy distribution
Josiah Willard Gibbs (1874–1878)
Claude Shannon (1948): maximum entropy in frequentist language
Edwin Thompson Jaynes: Bayesian re-interpretation
Harold Jeffreys (1939): standard work on (Bayesian) probability
Richard T. Cox (1946): derivation of probability laws
Computers: flowering of Bayesian methods
9 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
Probability = frequency
Straightforward:
Number of 1s in 60 dice throws ≈ 10: p = 1/6
Probability of plasma disruption p ≈ Ndisr/Ntot
Less straightforward:
Probability of fusion electricity by 2050?
Probability of mass of Saturn 90 mA ≤ mS < 100 mA?
11 Populations vs. sample
12 Statistics
E.g. weight w of Belgian men: unknown but fixed for every individual
Average weight µw in population?
Random variable W
Sample: W1, W2,..., Wn
Average weight: statistic (estimator) W
Central limit theorem: √ W ∼ p(W|µw, σw) ⇒ W ∼ N (µw, σw/ n)
13 Maximum likelihood parameter estimation
Maximum likelihood (ML) principle:
µˆw = arg max p(W1,..., Wn|µw, σw) + µw∈R n 1 (W − µ )2 ≈ √ − i w arg max ∏ exp 2 + 2σ µw∈R i=1 2πσw w " # 1 n (W − µ )2 = √ − i w arg max exp ∑ 2 + 2σ µw∈R 2πσw i=1 w
ML estimator (known σw):
1 n µˆw = W = ∑ Wi n i=1
14 Frequentist hypothesis testing
Weight of Dutch men compared to Belgian men (populations) Observed sample averages WNL, WBE Null hypothesis H0: µw,NL = µw,BE Test statistic: W − W NL BE ∼ N (0, 1) (under H ) σ 0 WCZ−WBE
15 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
16 Probability theory: quantifying uncertainty
Statistical (aleatoric) vs. epistemic uncertainty?
Every piece of information has uncertainty
Uncertainty = lack of information
Observation may reduce uncertainty
Probability (distribution) quantifies uncertainty
17 Example: physical sciences
Measurement of physical quantity
Origin of stochasticity:
Apparatus
Microscopic fluctuations
Systematic uncertainty is assigned a probability distribution
E.g. coin tossing, voltage measurement, probability of hypothesis vs. another, . . .
Bayesian: no random variables
18 What is probability?
Objective Bayesian view
Probability = real number ∈ [0, 1]
Always conditioned on known information
Notation: p(A|B) or p(A|I)
Extension of logic: measure of degree to which B implies A
Degree of plausibility, but subject to consistency
Same information ⇒ same probabilities
Probability distribution: outcome → probability
19 Joint, marginal and conditional distributions
p(x, y), p(x), p(y), p(x|y), p(y|x)
20 Example: normal distribution
Normal/Gaussian probability density function (PDF): 1 (x − µ)2 p(x|µ, σ, I) = √ exp − 2πσ 2σ2
Probability x1 ≤ x < x1 + dx Inverse problem: µ, σ given x?
21 Updating information states
Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ
Posterior distribution
22 Updating information states
Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ
Posterior distribution
22 Updating information states
Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ
Posterior distribution
22 Updating information states
Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit knowledge
Likelihood: misfit between model and data
Prior distribution: ‘expert’ or diffuse knowledge
Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ
Posterior distribution
22 Practical considerations
Updating state of knowledge = learning
‘Uninformative’ prior (distribution)
Assigning uninformative priors:
Transformation invariance (Jaynes ’68): e.g. uniform for location variable Show me the prior! Principle of indifference
Testable information: maximum entropy
...
23 Mean of a normal distribution: uniform prior
n measurements xi → x Independent and identically distributed xi: n 1 (x − µ)2 ( | ) = √ − i p x µ, σ, I ∏ exp 2 i=1 2πσ 2σ Bayes’ rule: p(µ, σ|x, I) ∝ p(x|µ, σ, I)p(µ, σ|I)
Suppose σ ≡ σe → delta function Assume µ ∈ [µmin, µmax] → uniform prior: 1 , if µ ∈ [µmin, µmax] p(µ|I) = µmax − µmin 0, otherwise
Let µmin → −∞, µmax → +∞ ⇒ improper prior Ensure proper posterior
24 Posterior for µ
Posterior: n (x − µ)2 ( | ) − ∑i=1 i p µ x, I ∝ exp 2 2σe Define n n 1 2 1 2 x ≡ ∑ xi, (∆x) ≡ ∑(xi − x¯) n i=1 n i=1 Adding and subtracting 2nx2 (‘completing the square’), 1 h i ( | ) − ( − )2 + ( )2 p µ x, I ∝ exp 2 µ x ∆x 2σe /n Retaining dependence on µ, (µ − x)2 ( | ) − p µ x, I ∝ exp 2 2σe /n 2 µ ∼ N (x, σe /n) 25 Mean of a normal distribution: normal prior
2 Normal prior: µ ∼ N (µ0, τ ) Posterior: n (x − µ)2 (µ − µ )2 ( | ) − ∑i=1 i × − 0 p µ x, I ∝ exp 2 exp 2 2σe 2τ
Expanding and completing the square,
2 µ ∼ N (µn, σn ),
where − n 1 n 1 1 ≡ 2 + ≡ 2 + µn σn 2 x 2 µ0 and µn σn 2 2 σe τ σe τ
µn is weighted average of µ0 and x
26 Care for the prior
Intrinsic part of Bayesian theory, enabling coherent learning
High-quality data may be scarce
Regularization of ill-posed problems (e.g. tomography)
Prior can retain influence: e.g. regression with errors in all variables
27 Unknown mean and standard deviation
Repeated measurements → information on σ
Scale variable σ → Jeffreys’ scale prior:
1 p(σ|I) ∝ , σ ∈ ]0, +∞[ σ
Posterior: " # 1 (µ − x)2 + (∆x)2 1 p(µ, σ|x, I) ∝ exp − × σn 2σ2/n σ
28 Marginal posterior for µ (1)
Marginalization = integrating out a (nuisance) parameter:
Z +∞ p(µ|x, I) = p(µ, σ|x, I) dσ 0 " #− n Z +∞ 2 2 2 1 (µ − x) + (∆x) n −1 −s ∝ s 2 e ds 0 2 2/n " #− n 1 n (µ − x)2 + (∆x)2 2 = Γ , 2 2 2/n
where
(µ − x)2 + (∆x)2 s ≡ 2σ2/n
29 Marginal posterior for µ (2)
After normalization: − n n " 2 # 2 Γ 2 (µ − x) p(µ|x, I) = q 1 + 2 n−1 (∆x)2 π(∆x) Γ 2 Changing variables, (µ − x)2 t ≡ q , with p(t|x, I)dt ≡ p(µ|x, I)dµ, (∆x)2/(n − 1) − n Γ n t2 2 p(t|x, I) = 2 1 + p n−1 n − 1 (n − 1)πΓ 2 Student’s t-distribution with parameter ν = n − 1 If n 1, " # 1 (µ − x)2 p(µ|x, I) −→ q exp − ( )2 2(∆x)2/n 2π ∆x /n 30 Marginal posterior for σ
Marginalization of µ:
Z +∞ p(σ|x, I) = p(µ, σ|x, I) dµ −∞ " # 1 (∆x)2 ∝ exp − σn 2σ2/n
Setting X ≡ n(∆x)2/σ2,
1 k −1 − X ( |x ) = 2 2 ≡ − p X , I k X e , k n 1 2 k 2 Γ 2
χ2 distribution with parameter k
31 The Laplace approximation (1)
Laplace (saddle point) approximation of distributions around the mode (= maximum) E.g. marginal for µ: − n n " 2 # 2 Γ 2 (µ − x) p(µ|x, I) = q 1 + 2 n−1 (∆x)2 π(∆x) Γ 2 Taylor expansion around mode: 1 d2(ln p) [ ( | )] ≈ [ ( | )] + ( − )2 ln p µ x, I ln p x x, I 2 µ x 2 dµ µ=x h ni n − 1 = ln Γ − ln Γ 2 2 1 h i n − ln π(∆x)2 − (µ − x)2 2 2(∆x)2
32 The Laplace approximation (2)
On the original scale: " # Γ n 1 (µ − x)2 p(µ|x, I) ≈ 2 exp − n−1 q 2 Γ 2 π(∆x)2 2(∆x) /n
Standard deviation σL → curvature of ln p:
" #−1/2 d2(ln p) = − σL 2 dµ µ=x
33 Laplace approximation: example 1
34 Laplace approximation: example 2
35 Multivariate Laplace approximation
t For θ = [θ1,..., θp] , 1 t p(θ|θ , I) ∝ exp (θ − θ ) [∇∇(ln p)] = (θ − θ ) 0 2 0 θ θ0 0
∇∇(ln p): Hessian matrix, where
−1 ΣL = − {[∇∇(ln p)]θ=θ0 }
36 Model comparison (hypothesis testing)
Let {Hi} be complete set of hypotheses Data D to support or reject hypotheses Bayes’ rule:
p(D|H , I)p(H |I) p(H |D, I) = i i , p(D|I) = p(D|H , I)p(H |I) i ( | ) ∑ i i p D I i
Assume single hypothesis H and complement H Odds ratio o: p(H|D, I) p(D|H, I) p(H|I) o ≡ = p(H|D, I) p(D|H, I) p(H|I) | {z } | {z } Bayes factor Prior odds
37 Testing a Gaussian mean (1)
E.g. n measurements xi of a quantity x Assume normal distribution with known variance σ2
Question: are the data compatible with mean µ = µ0? Yes: H No: H
Under H: 1 p(x|H, I) = C exp − (x − µ )2 2σ2/n 0
Under H: Z p(x|H, I) = p(x|µ, σ, I)p(µ|H, I) dµ (1)
38 Testing a Gaussian mean (2)
Assume bounds µmin and µmax:
1 1 p(µ|H, I) = (µmin ≤ µ ≤ µmax) |µmax − µmin|
Then (1) becomes
C Z µmax 1 ( | ) = − ( − )2 p x H, I exp 2 x µ dµ |µmax − µmin| µmin 2σ /n √ With SE ≡ σ/ n = standard error, assume
|x − µmin|, |x − µmax| SE
39 Testing a Gaussian mean (3)
Then C Z +∞ 1 p(x|H, I) ≈ exp − (x − µ)2 dµ |µ − µ | − 2σ2/n max √ min ∞ C SE 2π = |µmax − µmin|
Bayes factor BF:
p(x|H, I) |µmax − µmin| 1 − 1 z2 BF = = √ e 2 , p(x|H, I) SE 2π |x − µ | z ≡ 0 SE
Cf. frequentist hypothesis test
40 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
41 Monte Carlo sampling
(Strong) nonlinearities through prior or likelihood Skewed, heavy-tailed, multimodal Sampling (simulating) from distributions p Estimate properties of p Marginalization: sampling from joint distribution → sampling from marginals
42 Markov chain Monte Carlo (MCMC) (1)
Metropolis-Hastings sampling, Gibbs sampling
Simple algorithms, but implementation issues might arise
Target distribution π(θ|I)
Markov chain θ(t): conditional on last state
Sample from proposal distribution
43 Markov chain Monte Carlo (2)
Subchains sample from corresponding marginals Monte Carlo averages, e.g. n n 1 ( + ) 1 ( + ) 2 θ = θ tc i , (∆θ )2 = θ tc i − θ j n ∑ j j n ∑ j j i=1 i=1 44 Metropolis-Hastings algorithm (1)
h (t) (t) (t)it 1 (t) At t = 0, start from θ = θ1 , θ2 ..., θp
2 repeat
3 for j = 1 to p ∼ (t+1) (t+1) (t+1) (t) (t) (t) y qj η θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp , I
(y, with probability ρ (t+1) θj = (t) θj , with probability 1 − ρ
4 end
45 Metropolis-Hastings algorithm (2)
Acceptance probability:
(t+1) (t+1) (t+1) (t) (t) (t) ρ y, θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp (t+1) (t+1) (t+1) (t) (t) π θ1 , θ2 ,..., θj−1 , y, θj+1,..., θp , I ≡ min 1, (t+1) (t+1) (t+1) (t+1) (t) (t) π θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp , I (t) (t+1) (t+1) (t+1) (t) (t) qj θj θ1 , θ2 ,..., θj−1 , y, θj+1,..., θp , I × (t+1) (t+1) (t+1) (t) (t) (t) qj y θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp , I
Example: 2 − (t) ( + ) η θj q η θ t 1 I − j j , ∝ exp 2 2σj
46 MCMC example
47 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
48 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
49 Classification or clustering
50 Simple Bayesian classification
M clusters of in total n data points xi in P-dimensional space
Known class labels ωj (j = 1, . . . , M) of xi
Bayes’ rule for new point x:
p(x|ωj, I)p(ωj|I) p(ω |x, I) = j p(x|I)
Maximum a posteriori (MAP) classification rule for x:
Assign x to ωi = arg max p(ωj|x, I) = arg max p(x|ωj, I)p(ωj|I) ωj ωj
51 Examples of priors and likelihoods
Examples of prior probabilities (indifference):
p(ωi|I) = p(ωj|I), ∀i, j Count class membership: n p(ω |I) ≡ i , i = 1, . . . , M i n Examples of likelihoods: Naive Bayesian classifier:
P p(x|ωi) = ∏ p(xk|ωi), i = 1, . . . , M k=1
Multivariate Gaussian: 1 1 p(x|ω , I) = exp − (x − µ )tΣ−1(x − µ ) j p/2 1/2 j j j (2π) |Σj| 2
52 Optimality of Bayesian classifier
Bayesian classifier minimizes probability of misclassification
53 Optimality of Bayesian classifier
Bayesian classifier minimizes probability of misclassification
54 MAP classification: decision surfaces
MAP: maximize w.r.t. ωj:
ln p(ωj|x, I) = ln p(x|ωj, I) + ln p(ωj|I) Define (M = 2):
g(x) ≡ ln p(ω1|x, I) − ln p(ω2|x, I) For normal likelihood: Quadratic z }| { Linear 1 z }| { g(x) = xtΣ−1x − xtΣ−1x + µt Σ−1x − µt Σ−1x 2 2 1 1 1 2 2 1 t −1 1 t −1 1 |Σ2| p(ω1|I) − µ1Σ1 µ1 + µ2Σ2 µ2 + ln + ln 2 2 2 |Σ1| p(ω2|I) | {z } Constant g(x) separates classes: decision hypersurface
55 Discriminant analysis
56 QDA and LDA
Quadratic discriminant analysis Linear discriminant analysis (QDA) (LDA): Σ1 = Σ2
57 ELM type classification
Pinput − 1.41ΓD2 = 7.47
A. Shabbir et al., Fusion Eng. Des. 123, 717, 2017
58 Minimum distance classifiers (1)
Linear classifier with normal likelihood (Σ1 = Σ2 ≡ Σ):
t g(x) = θ (x − x0) = 0, −1 θ ≡ Σ (µ1 − µ2)
1 p(ω1|I) µ1 − µ2 x0 ≡ (µ1 + µ2) − ln 2 p(ω2|I) kµ1 − µ2kΣ−1 q t −1 kµ1 − µ2kΣ−1 ≡ (µ1 − µ2) Σ (µ1 − µ2) | {z } Mahalanobis distance
p(ω1|I) = p(ω2|I) ⇒ Minimum Mahalanobis distance classifier (to cluster centroid) Σ = σ2I ⇒ Minimum Euclidean distance classifier k-nearest neighbor classification (kNN)
59 Minimum distance classifiers (2)
60 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
61 Multilinear regression model
Regression model:
y = β0 + β1x1 + β2x2 + ... + βpxp + e e ∼ N (0, σ2), σ known
Negligible uncertainty on xj
Likelihood:
!2 1 1 p ( | ) = √ − − − p y x, β, σ, I exp 2 y β0 ∑ βjxj , 2πσ 2σ j=1 t t t β ≡ [β0, βp] , βp ≡ [β1,..., βp]
62 Maximum likelihood solution
Take n measurements: t y ≡ [y1,..., yn] 1 x11 ... x1p . . .. . X ≡ . . . . 1 xn1 ... xnp Conditional independence: 1 p(y|X, β, σ, I) = (2π)−n/2σ−n exp − (y − Xβ)t(y − Xβ) 2σ2 ML: maximize w.r.t. β: t t t 0 = ∇β(y − Xβ) (y − Xβ) = −2X y + 2X Xβ t −1 t ⇒ βML = (X X) X y = βLS | {z } Moore-Penrose pseudoinverse
63 MAP solution and posterior
Uniform priors on βj (not the most uninformative!): 1 p(β|y, X, σ, I) ∝ exp − (y − Xβ)t(y − Xβ) 2σ2
Due to linearity and Gaussianity: βMAP = βML = βLS Taylor expansion (exact!): t t (y − Xβ) (y − Xβ) = (y−XβMAP) (y − XβMAP) 1 + (β − β )t 2XtX (β − β ) 2 MAP MAP Hence, p(β|y, X, σ, I) = (2π)−n/2|Σ|−1/2 1 × exp − (β − β )tΣ−1(β − β ) , 2 MAP MAP Σ ≡ σ2(XtX)−1
64 Posterior predictive distribution (1)
New predictions by the model?
Posterior predictive distribution: Z p(ynew|xnew, y, X, σ, I) = p(ynew, β|xnew, y, X, σ, I) dβ Rp+1 Z = p(ynew|xnew, β, I) p(β|y, X, σ, I) dβ Rp+1
But t p(ynew|xnew, β, I) = δ(ynew − β xnew)
Fix β0 = ynew − β1xnew,1 − ... − βpxnew,p
65 Posterior predictive distribution (2)
Marginalize over βp with flat priors:
p(ynew|xnew, y, X, σ, I) 2 Z n " p # −n 1 ∝ σ exp − yi − ynew + βj(xnew,j − xij) dβ p 2 ∑ ∑ p R 2σ i=1 j=1
After (quite some) algebra, one finds simply
p ynew,MAP = ∑ xnew,jβMAP,j + βMAP,0 j=1
Simpler derivation based on properties of E and Var General posterior more complicated!
66 Overview
1 Origins of probability
2 Frequentist methods and statistics
3 Principles of Bayesian probability theory
4 Monte Carlo computational methods
5 Applications Classification Regression analysis
6 Conclusions and references
67 Conclusions
Frequentist vs. Bayesian methods: interpretation of probability
Bayesian probability: extension of logic to situations with uncertainty
Posterior probability of parameters or hypotheses
Numerical approach in general
Underlies or explains many machine learning techniques
68 References
D.S. Sivia and J. Skilling, Data Analysis: A Bayesian Tutorial, 2nd edition, Oxford University Press, 2006 W. von der Linden, V. Dose and U. von Toussaint, Bayesian Probability Theory: Applications in the Physical Sciences, Cambridge University Press, 2014 P. Gregory, Data Analysis: A Bayesian Tutorial, 2nd edition, Oxford University Press, 2006 S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective, Academic Press (Elsevier), 2015 E.T. Jaynes (G.L. Bretthorst, ed.), Probability Theory: The Logic of Science, Cambridge University Press, 2003 S.B. McGrayne, The theory that would not die: how Bayes’ rule cracked the enigma code, hunted down Russian submarines, and emerged triumphant from two centuries of controversy, Yale University Press, 2011 69