<<

A brief overview of in

Geert Verdoolaege 1Department of Applied , Ghent University, Ghent, Belgium 2Laboratory for Plasma Physics, Royal Military Academy (LPP–ERM/KMS), Brussels, Belgium Tutorial 3rd IAEA Technical Meeting on Fusion Data Processing, Validation and Analysis, 27-05-2019 Overview

1 Origins of probability

2 Frequentist methods and

3 Principles of Bayesian

4 Monte Carlo computational methods

5 Applications Classification

6 Conclusions and

2 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

3 Early

Earliest traces in Western civilization: Jewish writings,

Notion of probability in law, based on evidence

Usage in finance

Usage and demonstration in

4 Middle Ages

World is knowable but due to human ignorance

William of Ockham: Ockham’s razor

Probabilis: a supposedly ‘provable’ opinion

Counting of authorities

Later: degree of , a scale

Quantification:

Law, → Bayesian notion

Gaming → frequentist notion

5 Quantification

17th century: Pascal, Fermat, Huygens

Comparative testing of hypotheses

Population statistics

1713: by Jacob Bernoulli:

Weak law of large

Principle of indifference

De Moivre (1718):

6 Bayes and Laplace

Paper by (1763): inversion of Pierre Simon Laplace: Practical applications in physical 1812: Théorie Analytique des Probabilités

7 Frequentism and theory

George Boole (1854), (1866)

Sampling from a ‘population’

Notion of ‘ensembles’

Andrey Kolmogorov (1930s): of probability

Early classical statistics: , , , Egon Pearson

Applications in social sciences, medicine, natural sciences

8 Maximum and Bayesian methods

Statistical mechanics with (1868): maximum entropy energy distribution

Josiah Willard Gibbs (1874–1878)

Claude Shannon (1948): maximum entropy in frequentist language

Edwin Thompson Jaynes: Bayesian re-interpretation

Harold Jeffreys (1939): standard work on (Bayesian) probability

Richard T. Cox (1946): derivation of probability laws

Computers: flowering of Bayesian methods

9 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

10

Probability =

Straightforward:

Number of 1s in 60 throws ≈ 10: p = 1/6

Probability of plasma disruption p ≈ Ndisr/Ntot

Less straightforward:

Probability of fusion electricity by 2050?

Probability of mass of Saturn 90 mA ≤ mS < 100 mA?

11 Populations vs. sample

12 Statistics

E.g. weight w of Belgian men: unknown but fixed for every individual

Average weight µw in population?

Random W

Sample: W1, W2,..., Wn

Average weight: () W

Central limit theorem: √ W ∼ p(W|µw, σw) ⇒ W ∼ N (µw, σw/ n)

13 Maximum likelihood parameter estimation

Maximum likelihood (ML) principle:

µˆw = arg max p(W1,..., Wn|µw, σw) + µw∈R n 1  (W − µ )2  ≈ √ − i w arg max ∏ exp 2 + 2σ µw∈R i=1 2πσw w " # 1 n (W − µ )2 = √ − i w arg max exp ∑ 2 + 2σ µw∈R 2πσw i=1 w

ML estimator (known σw):

1 n µˆw = W = ∑ Wi n i=1

14 Frequentist hypothesis testing

Weight of Dutch men compared to Belgian men (populations) Observed sample averages WNL, WBE Null hypothesis H0: µw,NL = µw,BE Test statistic: W − W NL BE ∼ N (0, 1) (under H ) σ 0 WCZ−WBE

15 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

16 Probability theory: quantifying uncertainty

Statistical (aleatoric) vs. epistemic uncertainty?

Every piece of information has uncertainty

Uncertainty = lack of information

Observation may reduce uncertainty

Probability (distribution) quantifies uncertainty

17 Example: physical sciences

Measurement of physical quantity

Origin of stochasticity:

Apparatus

Microscopic fluctuations

Systematic uncertainty is assigned a

E.g. coin tossing, voltage measurement, probability of hypothesis vs. another, . . .

Bayesian: no random variables

18 What is probability?

Objective Bayesian view

Probability = ∈ [0, 1]

Always conditioned on known information

Notation: p(A|B) or p(A|I)

Extension of : of degree to which B implies A

Degree of plausibility, but subject to consistency

Same information ⇒ same

Probability distribution: → probability

19 Joint, marginal and conditional distributions

p(x, y), p(x), p(y), p(x|y), p(y|x)

20 Example:

Normal/Gaussian probability density (PDF): 1  (x − µ)2  p(x|µ, σ, I) = √ exp − 2πσ 2σ2

Probability x1 ≤ x < x1 + dx Inverse problem: µ, σ given x?

21 Updating information states

Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ

Posterior distribution

22 Updating information states

Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit knowledge

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ

Posterior distribution

22 Updating information states

Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit knowledge

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ

Posterior distribution

22 Updating information states

Bayes’ theorem x = data vector p(x|θ, I)p(θ|I) p(θ|x, I)= θ = vector of model parameters p(x|I) I = implicit knowledge

Likelihood: misfit between model and data

Prior distribution: ‘expert’ or diffuse knowledge

Evidence: Z Z p(x|I) = p(x, θ|I) dθ = p(x|θ, I)p(θ|I) dθ

Posterior distribution

22 Practical considerations

Updating state of knowledge = learning

‘Uninformative’ prior (distribution)

Assigning uninformative priors:

Transformation invariance (Jaynes ’68): e.g. uniform for location variable Show me the prior! Principle of indifference

Testable information: maximum entropy

...

23 of a normal distribution: uniform prior

n measurements xi → x Independent and identically distributed xi: n 1  (x − µ)2  ( | ) = √ − i p x µ, σ, I ∏ exp 2 i=1 2πσ 2σ Bayes’ rule: p(µ, σ|x, I) ∝ p(x|µ, σ, I)p(µ, σ|I)

Suppose σ ≡ σe → delta function Assume µ ∈ [µmin, µmax] → uniform prior:  1  , if µ ∈ [µmin, µmax] p(µ|I) = µmax − µmin 0, otherwise

Let µmin → −∞, µmax → +∞ ⇒ improper prior Ensure proper posterior

24 Posterior for µ

Posterior:  n (x − µ)2  ( | ) − ∑i=1 i p µ x, I ∝ exp 2 2σe Define n n 1 2 1 2 x ≡ ∑ xi, (∆x) ≡ ∑(xi − x¯) n i=1 n i=1 Adding and subtracting 2nx2 (‘completing the square’),  1 h i ( | ) − ( − )2 + ( )2 p µ x, I ∝ exp 2 µ x ∆x 2σe /n Retaining dependence on µ,  (µ − x)2  ( | ) − p µ x, I ∝ exp 2 2σe /n 2 µ ∼ N (x, σe /n) 25 Mean of a normal distribution: normal prior

2 Normal prior: µ ∼ N (µ0, τ ) Posterior:  n (x − µ)2   (µ − µ )2  ( | ) − ∑i=1 i × − 0 p µ x, I ∝ exp 2 exp 2 2σe 2τ

Expanding and completing the square,

2 µ ∼ N (µn, σn ),

where −  n 1   n 1  1 ≡ 2 + ≡ 2 + µn σn 2 x 2 µ0 and µn σn 2 2 σe τ σe τ

µn is weighted average of µ0 and x

26 Care for the prior

Intrinsic part of Bayesian theory, enabling coherent learning

High-quality data may be scarce

Regularization of ill-posed problems (e.g. tomography)

Prior can retain influence: e.g. regression with errors in all variables

27 Unknown mean and

Repeated measurements → information on σ

Scale variable σ → Jeffreys’ scale prior:

1 p(σ|I) ∝ , σ ∈ ]0, +∞[ σ

Posterior: " # 1 (µ − x)2 + (∆x)2 1 p(µ, σ|x, I) ∝ exp − × σn 2σ2/n σ

28 Marginal posterior for µ (1)

Marginalization = integrating out a (nuisance) parameter:

Z +∞ p(µ|x, I) = p(µ, σ|x, I) dσ 0 " #− n Z +∞ 2 2 2 1 (µ − x) + (∆x) n −1 −s ∝ s 2 e ds 0 2 2/n " #− n 1 n (µ − x)2 + (∆x)2 2 = Γ , 2 2 2/n

where

(µ − x)2 + (∆x)2 s ≡ 2σ2/n

29 Marginal posterior for µ (2)

After normalization: − n n  " 2 # 2 Γ 2 (µ − x) p(µ|x, I) = q 1 + 2 n−1  (∆x)2 π(∆x) Γ 2 Changing variables, (µ − x)2 t ≡ q , with p(t|x, I)dt ≡ p(µ|x, I)dµ, (∆x)2/(n − 1)  − n Γ n  t2  2 p(t|x, I) = 2 1 + p n−1  n − 1 (n − 1)πΓ 2 Student’s t-distribution with parameter ν = n − 1 If n  1, " # 1 (µ − x)2 p(µ|x, I) −→ q exp − ( )2 2(∆x)2/n 2π ∆x /n 30 Marginal posterior for σ

Marginalization of µ:

Z +∞ p(σ|x, I) = p(µ, σ|x, I) dµ −∞ " # 1 (∆x)2 ∝ exp − σn 2σ2/n

Setting X ≡ n(∆x)2/σ2,

1 k −1 − X ( |x ) = 2 2 ≡ − p X , I k  X e , k n 1 2 k 2 Γ 2

χ2 distribution with parameter k

31 The Laplace approximation (1)

Laplace (saddle point) approximation of distributions around the (= maximum) E.g. marginal for µ: − n n  " 2 # 2 Γ 2 (µ − x) p(µ|x, I) = q 1 + 2 n−1  (∆x)2 π(∆x) Γ 2 Taylor expansion around mode: 1 d2(ln p) [ ( | )] ≈ [ ( | )] + ( − )2 ln p µ x, I ln p x x, I 2 µ x 2 dµ µ=x h ni  n − 1 = ln Γ − ln Γ 2 2 1 h i n − ln π(∆x)2 − (µ − x)2 2 2(∆x)2

32 The Laplace approximation (2)

On the original scale:  " # Γ n 1 (µ − x)2 p(µ|x, I) ≈ 2 exp − n−1  q 2 Γ 2 π(∆x)2 2(∆x) /n

Standard deviation σL → curvature of ln p:

" #−1/2 d2(ln p) = − σL 2 dµ µ=x

33 Laplace approximation: example 1

34 Laplace approximation: example 2

35 Multivariate Laplace approximation

t For θ = [θ1,..., θp] ,   1 t p(θ|θ , I) ∝ exp (θ − θ ) [∇∇(ln p)] = (θ − θ ) 0 2 0 θ θ0 0

∇∇(ln p): Hessian matrix, where

−1 ΣL = − {[∇∇(ln p)]θ=θ0 }

36 Model comparison (hypothesis testing)

Let {Hi} be complete of hypotheses Data D to support or reject hypotheses Bayes’ rule:

p(D|H , I)p(H |I) p(H |D, I) = i i , p(D|I) = p(D|H , I)p(H |I) i ( | ) ∑ i i p D I i

Assume single hypothesis H and complement H ratio o: p(H|D, I) p(D|H, I) p(H|I) o ≡ = p(H|D, I) p(D|H, I) p(H|I) | {z } | {z } Prior odds

37 Testing a Gaussian mean (1)

E.g. n measurements xi of a quantity x Assume normal distribution with known σ2

Question: are the data compatible with mean µ = µ0? Yes: H No: H

Under H:  1  p(x|H, I) = C exp − (x − µ )2 2σ2/n 0

Under H: Z p(x|H, I) = p(x|µ, σ, I)p(µ|H, I) dµ (1)

38 Testing a Gaussian mean (2)

Assume bounds µmin and µmax:

1 1 p(µ|H, I) = (µmin ≤ µ ≤ µmax) |µmax − µmin|

Then (1) becomes

C Z µmax  1  ( | ) = − ( − )2 p x H, I exp 2 x µ dµ |µmax − µmin| µmin 2σ /n √ With SE ≡ σ/ n = , assume

|x − µmin|, |x − µmax|  SE

39 Testing a Gaussian mean (3)

Then C Z +∞  1  p(x|H, I) ≈ exp − (x − µ)2 dµ |µ − µ | − 2σ2/n max √ min ∞ C SE 2π = |µmax − µmin|

Bayes factor BF:

p(x|H, I) |µmax − µmin| 1 − 1 z2 BF = = √ e 2 , p(x|H, I) SE 2π |x − µ | z ≡ 0 SE

Cf. frequentist hypothesis test

40 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

41 Monte Carlo sampling

(Strong) nonlinearities through prior or likelihood Skewed, heavy-tailed, multimodal Sampling (simulating) from distributions p Estimate properties of p Marginalization: sampling from joint distribution → sampling from marginals

42 Monte Carlo (MCMC) (1)

Metropolis-Hastings sampling,

Simple algorithms, but implementation issues might arise

Target distribution π(θ|I)

Markov chain θ(t): conditional on last state

Sample from proposal distribution

43 (2)

Subchains sample from corresponding marginals Monte Carlo averages, e.g. n n 1 ( + ) 1  ( + ) 2 θ = θ tc i , (∆θ )2 = θ tc i − θ j n ∑ j j n ∑ j j i=1 i=1 44 Metropolis-Hastings algorithm (1)

h (t) (t) (t)it 1 (t) At t = 0, start from θ = θ1 , θ2 ..., θp

2 repeat

3 for j = 1 to p   ∼ (t+1) (t+1) (t+1) (t) (t) (t) y qj η θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp , I

(y, with probability ρ (t+1) θj = (t) θj , with probability 1 − ρ

4 end

45 Metropolis-Hastings algorithm (2)

Acceptance probability:

 (t+1) (t+1) (t+1) (t) (t) (t) ρ y, θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp   (t+1) (t+1) (t+1) (t) (t)   π θ1 , θ2 ,..., θj−1 , y, θj+1,..., θp , I ≡ min 1,  (t+1) (t+1) (t+1) (t+1) (t) (t)   π θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp , I   (t) (t+1) (t+1) (t+1) (t) (t)  qj θj θ1 , θ2 ,..., θj−1 , y, θj+1,..., θp , I  ×   (t+1) (t+1) (t+1) (t) (t) (t)  qj y θ1 , θ2 ,..., θj−1 , θj , θj+1,..., θp , I

Example:   2  − (t)  ( + )  η θj q η θ t 1 I −  j j , ∝ exp  2  2σj

46 MCMC example

47 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

48 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

49 Classification or clustering

50 Simple Bayesian classification

M clusters of in total n data points xi in P-dimensional

Known class labels ωj (j = 1, . . . , M) of xi

Bayes’ rule for new point x:

p(x|ωj, I)p(ωj|I) p(ω |x, I) = j p(x|I)

Maximum a posteriori (MAP) classification rule for x:

Assign x to ωi = arg max p(ωj|x, I) = arg max p(x|ωj, I)p(ωj|I) ωj ωj

51 Examples of priors and likelihoods

Examples of prior probabilities (indifference):

p(ωi|I) = p(ωj|I), ∀i, j Count class membership: n p(ω |I) ≡ i , i = 1, . . . , M i n Examples of likelihoods: Naive Bayesian classifier:

P p(x|ωi) = ∏ p(xk|ωi), i = 1, . . . , M k=1

Multivariate Gaussian: 1  1  p(x|ω , I) = exp − (x − µ )tΣ−1(x − µ ) j p/2 1/2 j j j (2π) |Σj| 2

52 Optimality of Bayesian classifier

Bayesian classifier minimizes probability of misclassification

53 Optimality of Bayesian classifier

Bayesian classifier minimizes probability of misclassification

54 MAP classification: decision surfaces

MAP: maximize w.r.t. ωj:

ln p(ωj|x, I) = ln p(x|ωj, I) + ln p(ωj|I) Define (M = 2):

g(x) ≡ ln p(ω1|x, I) − ln p(ω2|x, I) For normal likelihood: Quadratic z }| { Linear 1   z }| { g(x) = xtΣ−1x − xtΣ−1x + µt Σ−1x − µt Σ−1x 2 2 1 1 1 2 2 1 t −1 1 t −1 1 |Σ2| p(ω1|I) − µ1Σ1 µ1 + µ2Σ2 µ2 + ln + ln 2 2 2 |Σ1| p(ω2|I) | {z } Constant g(x) separates classes: decision hypersurface

55 Discriminant analysis

56 QDA and LDA

Quadratic discriminant analysis Linear discriminant analysis (QDA) (LDA): Σ1 = Σ2

57 ELM type classification

Pinput − 1.41ΓD2 = 7.47

A. Shabbir et al., Fusion Eng. Des. 123, 717, 2017

58 Minimum distance classifiers (1)

Linear classifier with normal likelihood (Σ1 = Σ2 ≡ Σ):

t g(x) = θ (x − x0) = 0, −1 θ ≡ Σ (µ1 − µ2)

1 p(ω1|I) µ1 − µ2 x0 ≡ (µ1 + µ2) − ln 2 p(ω2|I) kµ1 − µ2kΣ−1 q t −1 kµ1 − µ2kΣ−1 ≡ (µ1 − µ2) Σ (µ1 − µ2) | {z } Mahalanobis distance

p(ω1|I) = p(ω2|I) ⇒ Minimum Mahalanobis distance classifier (to cluster centroid) Σ = σ2I ⇒ Minimum Euclidean distance classifier k-nearest neighbor classification (kNN)

59 Minimum distance classifiers (2)

60 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

61 Multilinear regression model

Regression model:

y = β0 + β1x1 + β2x2 + ... + βpxp + e e ∼ N (0, σ2), σ known

Negligible uncertainty on xj

Likelihood:

 !2 1 1 p ( | ) = √ − − − p y x, β, σ, I exp  2 y β0 ∑ βjxj  , 2πσ 2σ j=1 t t t β ≡ [β0, βp] , βp ≡ [β1,..., βp]

62 Maximum likelihood solution

Take n measurements: t y ≡ [y1,..., yn]   1 x11 ... x1p  . . .. .  X ≡  . . . .  1 xn1 ... xnp :  1  p(y|X, β, σ, I) = (2π)−n/2σ−n exp − (y − Xβ)t(y − Xβ) 2σ2 ML: maximize w.r.t. β: t t t 0 = ∇β(y − Xβ) (y − Xβ) = −2X y + 2X Xβ t −1 t ⇒ βML = (X X) X y = βLS | {z } Moore-Penrose pseudoinverse

63 MAP solution and posterior

Uniform priors on βj (not the most uninformative!):  1  p(β|y, X, σ, I) ∝ exp − (y − Xβ)t(y − Xβ) 2σ2

Due to linearity and Gaussianity: βMAP = βML = βLS Taylor expansion (exact!): t t (y − Xβ) (y − Xβ) = (y−XβMAP) (y − XβMAP) 1 + (β − β )t 2XtX (β − β ) 2 MAP MAP Hence, p(β|y, X, σ, I) = (2π)−n/2|Σ|−1/2  1  × exp − (β − β )tΣ−1(β − β ) , 2 MAP MAP Σ ≡ σ2(XtX)−1

64 Posterior predictive distribution (1)

New by the model?

Posterior predictive distribution: Z p(ynew|xnew, y, X, σ, I) = p(ynew, β|xnew, y, X, σ, I) dβ Rp+1 Z = p(ynew|xnew, β, I) p(β|y, X, σ, I) dβ Rp+1

But t p(ynew|xnew, β, I) = δ(ynew − β xnew)

Fix β0 = ynew − β1xnew,1 − ... − βpxnew,p

65 Posterior predictive distribution (2)

Marginalize over βp with flat priors:

p(ynew|xnew, y, X, σ, I)  2 Z n " p # −n  1  ∝ σ exp − yi − ynew + βj(xnew,j − xij) dβ p 2 ∑ ∑ p R  2σ i=1 j=1 

After (quite some) algebra, one finds simply

p ynew,MAP = ∑ xnew,jβMAP,j + βMAP,0 j=1

Simpler derivation based on properties of E and Var General posterior more complicated!

66 Overview

1 Origins of probability

2 Frequentist methods and statistics

3 Principles of Bayesian probability theory

4 Monte Carlo computational methods

5 Applications Classification Regression analysis

6 Conclusions and references

67 Conclusions

Frequentist vs. Bayesian methods: interpretation of probability

Bayesian probability: extension of logic to situations with uncertainty

Posterior probability of parameters or hypotheses

Numerical approach in general

Underlies or explains many techniques

68 References

D.S. Sivia and J. Skilling, : A Bayesian Tutorial, 2nd edition, Oxford University Press, 2006 W. von der Linden, V. Dose and U. von Toussaint, Bayesian Probability Theory: Applications in the Physical Sciences, Cambridge University Press, 2014 P. Gregory, Data Analysis: A Bayesian Tutorial, 2nd edition, Oxford University Press, 2006 S. Theodoridis, Machine Learning: A Bayesian and Optimization Perspective, Academic Press (Elsevier), 2015 E.T. Jaynes (G.L. Bretthorst, ed.), Probability Theory: The Logic of Science, Cambridge University Press, 2003 S.B. McGrayne, The theory that would not die: how Bayes’ rule cracked the enigma code, hunted down Russian submarines, and emerged triumphant from two centuries of controversy, Yale University Press, 2011 69