L7: Probability Basics Outline Probability Probability Theory

Outline 1. Bayes’ Law L7: Probability Basics 2. Probability distributions CS 344R/393R: Robotics Benjamin Kuipers 3. Decisions under uncertainty Probability Probability Theory • For a proposition A, the probability p(A) is • p(A,B) is the joint probability of A and B. your degree of belief in the truth of A. • p(A | B) is the conditional probability of A – By convention, 0 p(A) 1. ≤ ≤ given B. p(A | B) + p(¬A | B) =1 • This is the Bayesian view of probability. – It contrasts with the view that probability is the p(A,B) = p(B | A)" p(A) frequency that A is true, over some large population of experiments. • Bayes Law: p(A, B) p(A | B) " p(B) – The frequentist view makes it awkward to use p(B | A) = = data to estimate the value of a constant. ! p(A) p(A) Bayes’ Law for Diagnosis Which Hypothesis To Prefer? • Let H be a hypothesis, E be evidence. p(E | H) " p(H) • Maximum Likelihood (ML) p(H | E ) = p(E) – maxH p(E | H) • p(E | H) is the likelihood of the data, given – The model that makes the data most likely the hypothesis. • p(H) is prior probability of hypothesis. • Maximum a posteriori (MAP) – max p(E | H) p(H) • p(E) is prior probability of the evidence H (but acts as a normalizing constant). – The model that is the most probable explanation • p(H | E) is what you really want to know (posterior probability of hypothesis). • (Story: perfect match to rare disease) ! 1 ! ! Bayes Law Independence • The denominator in Bayes Law acts as a • Two random variables are independent if normalizing constant: – p(X,Y) = p(X) p(Y) – p(X | Y) = p(X) p(E | H) p(H) p(H | E) = = " p(E | H) p(H) – p(Y | X) = p(Y) p(E) – These are all equivalent. 1 " = p(E)#1 = • X and Y are conditionally independent given Z if $ p(E | H)p(H) – p(X,Y | Z) = p(X | Z) p(Y | Z) H – p(X | Y, Z) = p(X | Z) – p(Y | X, Z) = p(Y | Z) • It ensures that the probabilities sum to 1 • Independence simplifies inference. ! across all the hypotheses H. Accumulating Evidence (Naïve Bayes) Bayes Nets Represent Dependence p(d1 | H) p(d2 | H) p(dn | H) • The nodes are random variables. p(H | d1,d2 Ldn ) = p(H) L p(d1) p(d2 ) p(dn ) • The links represent dependence. n p(X i | parents(X i )) p(di | H) p(H | d1,d2 Ldn ) = p(H)* – Independence can be inferred from network " p(d ) ! i=1 i • The network represents how the joint n probability distribution can be decomposed. p(H | d1,d2 Ldn ) = " p(H)*# p(di | H) ! n i=1 p(X , X ) p(X | parents(X )) ! 1 L n = " i i n i=1 log p(H | d ,d d ) = log p(H) + log p(d | H) + $ # 1 2 L n " i • There are effective propagation algorithms. i=1 ! ! ! Simple Bayes Net Example Outline 1. Bayes’ Law 2. Probability distributions 3. Decisions under uncertainty 2 Expectations Variance and Covariance • Let x be a random variable. • The variance is E[ (x-E[x])2 ] • The expected value E[x] is the mean: N N 2 2 1 2 1 " = E[(x # x ) ] = $(xi # x ) E[x] = x p(x) dx # x = x N 1 " N $ i 1 • Covariance matrix is E[ (x-E[x])(x-E[x])T ] – The probability-weighted mean of all possible N values. The sample mean approaches it. 1 Cij = #(xik " x i )(x jk " x j ) • Expected value of a vector x is by component. N ! k=1 E[x] = x = [x , x ]T 1 L n – Divide by N−1 to make the sample variance an unbiased estimator for the population variance. Biased and Unbiased Estimators Covariance Matrix • Strictly speaking, the sample variance N 1 • Along the diagonal, Cii are variances. " 2 = E[(x # x ) 2] = (x # x ) 2 $ i • Off-diagonal C are essentially correlations. N 1 ij is a biased estimate of the population variance. An unbiased estimator is: # 2 & C1,1 = "1 C1,2 C1,N 1 N s2 = (x " x )2 % 2 ( # i C C = " N "1 1 % 2,1 2,2 2 ( • But: “If the difference between N and N−1 % O M ( ever matters to you, then you are probably % ( up to no good anyway …” [Press, et al] 2 ! $ CN,1 L CN,N = " N ' ! ! Independent Variation Dependent Variation ! ! • x and y are • c and d are random Gaussian random variables. variables (N=100) • Generated with • Generated with c=x+y d=x-y σx=1 σy=3 • Covariance matrix: • Covariance matrix: #10.62 "7.93& "0.90 0.44% Ccd = % ( Cxy = $ ' $" 7.93 8.84 ' #0.44 8.82& ! ! ! 3 Estimates and Uncertainty Gaussian (Normal) Distribution • Conditional probability density function • Completely described by N(µ,σ) – Mean µ – Standard deviation σ, variance σ 2 1 $( x$ µ )2 / 2" 2 " 2# e Illustrating the Central Limit Thm The Central Limit Theorem – Add 1, 2, 3, 4 variables from the same distribution. • The sum of many random variables – with the same mean, but – with arbitrary conditional density functions, converges to a Gaussian density function. • If a model omits many small unmodeled effects, then the resulting error should converge to a Gaussian density function. Detecting Modeling Error Outline • Every model is incomplete. – If the omitted factors are all small, the resulting errors should add up to a Gaussian. 1. Bayes’ Law ! • If the error between a model and the data 2. Probability distributions is not Gaussian, – Then some omitted factor is not small. 3. Decisions under uncertainty – One should find the dominant source of error and add it to the model. 4 Diagnostic Errors and Tests: Sensor Noise and Sensor Interpretation Decision Thresholds • Interpreting sensor values is like diagnosis. • Overlapping response Test=Pos Test=Neg to different cases: Disease True False No Yes present Positive Negative hit miss Disease False True false correct absent Positive Negative alarm reject • Every test has false positives and negatives. – Sonar(fwd)=d implies Obstacle-at-distance(d) ?? The Test ROC separation Threshold Curves d'= spread Requires a • The overlap Trade-Off d′ controls the trade-off ! • You can’t between eliminate all types of error. errors. • For more, search on • Choose which Signal Detection errors are Theory. important Bayesian Reasoning • One strength of Bayesian methods is that they reason with probability distributions, not just the most likely individual case. • For more, see Andrew Moore’s tutorial slides – http://www.autonlab.org/tutorials/ • Coming up: – Regression to find models from data – Kalman filters to track dynamical systems – Visual object trackers. 5.

L7: Probability Basics Outline Probability Probability Theory

2 Probability Theory and Classical Statistics

Importance Sampling

Probability Distributions and Related Mathematical Constructs

Lecture 17: Bivariate Normal Distribution F (X, Y) Is Given By

Bridgesampling: an R Package for Estimating Normalizing Constants

Bayesian Inference

9 Importance Sampling 3 9.1 Basic Importance Sampling

Derivations of the Univariate and Multivariate Normal Density

Exact Formulas for the Normalizing Constants of Wishart Distributions for Graphical Models

Orthogonal Polynomials in Stein's Method

Beta Distribution

The Continuous Bernoulli: Fixing a Pervasive Error in Variational