Outline

1. Bayes’ Law L7: Probability Basics 2. Probability distributions CS 344R/393R: Robotics Benjamin Kuipers 3. Decisions under uncertainty

Probability

• For a proposition A, the probability p(A) is • p(A,B) is the joint probability of A and B. your degree of belief in the truth of A. • p(A | B) is the conditional probability of A – By convention, 0 p(A) 1. ≤ ≤ given B. p(A | B) + p(¬A | B) =1 • This is the Bayesian view of probability. – It contrasts with the view that probability is the p(A,B) = p(B | A)" p(A) frequency that A is true, over some large population of experiments. • Bayes Law: p(A, B) p(A | B) " p(B) – The frequentist view makes it awkward to use p(B | A) = = data to estimate the value of a constant. ! p(A) p(A)

Bayes’ Law for Diagnosis Which Hypothesis To Prefer? • Let H be a hypothesis, E be evidence. p(E | H) " p(H) • Maximum Likelihood (ML) p(H | E ) = p(E) – maxH p(E | H) • p(E | H) is the likelihood of the data, given – The model that makes the data most likely the hypothesis. • p(H) is prior probability of hypothesis. • Maximum a posteriori (MAP) – max p(E | H) p(H) • p(E) is prior probability of the evidence H (but acts as a normalizing constant). – The model that is the most probable explanation • p(H | E) is what you really want to know (posterior probability of hypothesis). • (Story: perfect match to rare disease)

! 1

!

! Bayes Law Independence • The denominator in Bayes Law acts as a • Two random variables are independent if normalizing constant: – p(X,Y) = p(X) p(Y) – p(X | Y) = p(X) p(E | H) p(H) p(H | E) = = " p(E | H) p(H) – p(Y | X) = p(Y) p(E) – These are all equivalent. 1 " = p(E)#1 = • X and Y are conditionally independent given Z if $ p(E | H)p(H) – p(X,Y | Z) = p(X | Z) p(Y | Z) H – p(X | Y, Z) = p(X | Z) – p(Y | X, Z) = p(Y | Z) • It ensures that the probabilities sum to 1 • Independence simplifies inference. ! across all the hypotheses H.

Accumulating Evidence (Naïve Bayes) Bayes Nets Represent Dependence

p(d1 | H) p(d2 | H) p(dn | H) • The nodes are random variables. p(H | d1,d2 Ldn ) = p(H) L p(d ) p(d ) p(d ) 1 2 n • The links represent dependence. n p(X i | parents(X i )) p(di | H) p(H | d1,d2 Ldn ) = p(H)* – Independence can be inferred from network " p(d ) ! i=1 i • The network represents how the joint n probability distribution can be decomposed. p(H | d1,d2 Ldn ) = " p(H)*# p(di | H) ! n i=1 p(X , X ) p(X | parents(X )) ! 1 L n = " i i n i=1 log p(H | d ,d d ) = log p(H) + log p(d | H) + $ # 1 2 L n " i • There are effective propagation algorithms. i=1 ! !

!

Simple Bayes Net Example Outline

1. Bayes’ Law

2. Probability distributions

3. Decisions under uncertainty

2 Expectations and Covariance • Let x be a random variable. • The variance is E[ (x-E[x])2 ] • The E[x] is the mean: N N 2 2 1 2 1 " = E[(x # x ) ] = $(xi # x ) E[x] = x p(x) dx # x = x N 1 " N $ i 1 • Covariance matrix is E[ (x-E[x])(x-E[x])T ] – The probability-weighted mean of all possible N values. The sample mean approaches it. 1 Cij = #(xik " x i )(x jk " x j ) • Expected value of a vector x is by component. N ! k=1 E[x] = x = [x , x ]T 1 L n – Divide by N−1 to make the sample variance an unbiased estimator for the population variance.

Biased and Unbiased Estimators Covariance Matrix • Strictly speaking, the sample variance N 1 • Along the diagonal, Cii are . " 2 = E[(x # x ) 2] = (x # x ) 2 $ i • Off-diagonal C are essentially correlations. N 1 ij is a biased estimate of the population variance. An unbiased estimator is: # 2 & C1,1 = "1 C1,2 C1,N 1 N s2 = (x " x )2 % 2 ( # i C C = " N "1 1 % 2,1 2,2 2 ( • But: “If the difference between N and N−1 % O M ( ever matters to you, then you are probably % ( up to no good anyway …” [Press, et al] 2 $ CN,1 L CN,N = " N ' ! !

! Independent Variation Dependent Variation ! ! • x and y are • c and d are random Gaussian random variables. variables (N=100) • Generated with • Generated with c=x+y d=x-y σx=1 σy=3 • Covariance matrix: • Covariance matrix: #10.62 "7.93& "0.90 0.44% Ccd = % ( Cxy = $ ' $" 7.93 8.84 ' #0.44 8.82&

! ! !

3 Estimates and Uncertainty Gaussian (Normal) Distribution

• Conditional probability density function • Completely described by N(µ,σ) – Mean µ – Standard deviation σ, variance σ 2

1 $( x$ µ )2 / 2" 2 " 2# e

Illustrating the Central Limit Thm The Central Limit Theorem – Add 1, 2, 3, 4 variables from the same distribution. • The sum of many random variables – with the same mean, but – with arbitrary conditional density functions, converges to a Gaussian density function.

• If a model omits many small unmodeled effects, then the resulting error should converge to a Gaussian density function.

Detecting Modeling Error Outline • Every model is incomplete. – If the omitted factors are all small, the resulting errors should add up to a Gaussian. 1. Bayes’ Law

! • If the error between a model and the data 2. Probability distributions is not Gaussian, – Then some omitted factor is not small. 3. Decisions under uncertainty – One should find the dominant source of error and add it to the model.

4 Diagnostic Errors and Tests: Sensor Noise and Sensor Interpretation Decision Thresholds • Interpreting sensor values is like diagnosis. • Overlapping response Test=Pos Test=Neg to different cases: Disease True False No Yes present Positive Negative hit miss Disease False True false correct absent Positive Negative alarm reject

• Every test has false positives and negatives. – Sonar(fwd)=d implies Obstacle-at-distance(d) ??

The Test ROC separation Threshold Curves d'= spread Requires a • The overlap Trade-Off d′ controls the trade-off ! • You can’t between eliminate all types of error. errors. • For more, search on • Choose which Signal Detection errors are Theory. important

Bayesian Reasoning • One strength of Bayesian methods is that they reason with probability distributions, not just the most likely individual case.

• For more, see Andrew Moore’s tutorial slides – http://www.autonlab.org/tutorials/

• Coming up: – Regression to find models from data – Kalman filters to track dynamical systems – Visual object trackers.

5