Overview Examples Probability Review Probability Review Independence

Overview CS 287: Advanced Robotics Thus far: Fall 2009 Optimal control and reinforcement learning We always assumed we got to observe the state at each time and the challenge was to choose a good action Lecture 21: HMMs, Bayes filter, smoother, Kalman filters Current and next set of lectures The state is not observed Instead, we get some sensory information about the state Pieter Abbeel UC Berkeley EECS Challenge: compute a probability distribution over the state which accounts for the sensory information (“evidence”) which we have observed. Examples Probability review Helicopter For any random variables X, Y we have: A choice of state: position, orientation, velocity, angular rate Definition of conditional probability: Sensors: GPS : noisy estimate of position (sometimes also velocity) P(X=x | Y=y) = P(X=x, Y=y) / P(Y=y) Inertial sensing unit: noisy measurements from (i) 3-axis gyro [=angular rate sensor], (ii) 3-axis accelerometer [=measures acceleration + gravity; e.g., measures (0,0,0) in Chain rule: (follows directly from the above) free-fall], (iii) 3-axis magnetometer Mobile robot inside building P(X=x, Y=y) = P(X=x) P(Y=y | X=x ) A choice of state: position and heading = P(Y=y) P(X=x | Y=y) Sensors: Bayes rule: (really just a re-ordering of terms in the above) Odometry (=sensing motion of actuators): e.g., wheel encoders Laser range finder: measures time of flight of a laser beam between departure and return (return is typically happening when hitting a surface that reflects the beam back P(X=x | Y=y) = P(Y=y | X=x) P(X=x) / P(Y=y) to where it came from) Marginalization: P(X=x) = ∑y P(X=x, Y=y) Note: no assumptions beyond X, Y being random variables are made for any of these to hold true (and when we divide by something, that something is not zero) Probability review Independence For any random variables X, Y, Z, W we have: Two random variables X and Y are independent iff Conditional probability: (can condition on a third variable z throughout) for all x, y : P(X=x, Y=y) = P(X=x) P(Y=y) P(X=x | Y=y, Z=z) = P(X=x, Y=y | Z=z) / P(Y=y | Z=z) Representing a probability distribution over a set of random variables X1, X2, …, XT in its most general form can be expensive. Chain rule: T E.g., if all Xi are binary valued, then there would be a total of 2 P(X=x, Y=y, Z=z, W=w) = P(X=x) P(Y=y | X=x ) P(Z=z | X=x, Y=y) P(W=w| X=x, Y=y, Z=z) possible instantiations and it would require 2T-1 numbers to represent the probability distribution. Bayes rule: (can condition on other variable z throughout) However, if we assumed the random variables were independent, P(X=x | Y=y, Z=z) = P(Y=y | X=x, Z=z) P(X=x | Z=z) / P(Y=y | Z=z) then we could very compactly represent the joint distribution as Marginalization: follows: P(X 1=x 1, X2=x 2, …, XT=xT) = P(X 1=x 1) P(X 2=x 2) … P(X T=xT) P(X=x | W=w) = ∑y,z P(X=x, Y=y, Z=z | W=w) Thanks to the independence assumptions, for the binary case, we went from requiring 2T-1 parameters, to only requiring T parameters! Note: no assumptions beyond X, Y, Z, W being random variables are made for any of these to hold true (and when we divide by something, that something is not zero) Unfortunately independence is often too strong an assumption … Page 1 Conditional independence Markov Models Two random variables X and Y are conditionally independent given a third Models a distribution over a set of random variables X1, X2, …, XT where the random variable Z iff index is typically associated with some notion of time. for all x, y, z : P(X=x, Y=y | Z=z) = P(X=x | Z=z) P(Y=y | Z=z) Markov models make the assumption: Xt is independent of X1, …, Xt-2 when given Xt-1 Chain rule (which holds true for all distributions, no assumptions needed): Chain rule: (always holds true, not just in Markov models!) P(X=x,Y=y,Z=z,W=w) = P(X=x)P(Y=y|X=x)P(Z=z|X=x,Y=y)P(W=w|X=x,Y=y,Z=z) P(X = x , X = x , …, X = x ) = ∏ P(X = x | X = x , X = x , …, X = x ) For binary variables the representation requires 1 + 2*1 + 4*1 + 8*1 = 24-1 numbers 1 1 2 2 T T t t t t-1 t-1 t-2 t-2 1 1 (just like a full joint probability table) Now apply the Markov conditional independence assumption: Now assume Z independent of X given Y, and assume W independent of X and P(X 1 = x1, X2 = x2, …, XT = xT) = ∏t P(Xt = xt | Xt-1 = xt-1) (1) Y given Z, then we obtain: in binary case: 1 + 2*(T-1) numbers required to represent the joint distribution over all variables (vs. 2T – 1) P(X=x,Y=y,Z=z,W=w) = P(X=x)P(Y=y|X=x)P(Z=z|Y=y)P(W=w|Z=z) For binary variables the representation requires 1 + 2*1 + 2*1 + 2*1 = 1+(4-1)*2 Graphical representation: a variable Xt receives an arrow from the variables numbers --- significantly less!! appearing in its conditional probability in the expression for the joint distribution (1) [called a Bayesian network or Bayes net representation] X1 X2 X3 X4 XT Hidden Markov Models Hidden Markov Models Underlying Markov model over states Xt X1 X2 X3 X4 XT Assumption 1: Xt independent of X1, …, Xt-2 given Xt-1 For each state Xt there is a random variable Zt which is a sensory measurement of Xt Z1 Z2 Z3 Z4 ZT Assumption 2: Zt is assumed conditionally independent of the other variables given Xt P(X 1=x 1, Z 1=z 1, X 2=x 2, Z 2=z 2, …, XT=xT, Z T=zT) = This gives the following graphical (Bayes net) representation: Chain rule: (no assumptions) HMM assumptions: P(X = x ) P(X 1 = x1) 1 1 P(Z = z | X = x ) P(Z 1 = z1 | X1 = x1 ) 1 1 1 1 P(X = x | X = x ) X1 X2 X3 X4 XT P(X 2 = x2 | X1 = x1 , Z1 = z1) 2 2 1 1 P(Z = z | X = x ) P(Z 2 = z2 | X1 = x1, Z1 = z1, X2 = x2 ) 2 2 2 2 … … Z Z Z Z Z P(X = x | X = x ) 1 2 3 4 T P(X T = xT | X1 = x1, Z1 = z1, … , XT-1 = xT-1, ZT-1 = zT-1 ) T T T-1 T-1 P(Z = z | X = x ) P(Z T = zT | X1 = x1, Z1 = z1, … , XT-1 = xT-1, ZT-1 = zT-1 , XT = xT) T T T T Mini quiz Example What would the graph look like for a Bayesian network with no conditional independence assumptions? Our particular choice of ordering of variables in the chain rule enabled us to easily incorporate the HMM assumptions. What if we had chosen a different ordering in the chain rule expansion? The HMM is defined by: Initial distribution: Transitions: Observations: Page 2 Real HMM Examples Filtering / Monitoring Robot localization: Filtering, or monitoring, is the task of tracking the distribution Observations are range readings (continuous) P(Xt | Z1 = z1 , Z2 = z2 , …, Zt = zt) States are positions on a map (continuous) over time. This distribution is called the belief state. Speech recognition HMMs: We start with P(X 0) in an initial setting, usually uniform Observations are acoustic signals (continuous valued) States are specific positions in specific words (so, tens of thousands) As time passes, or we get observations, we update the belief state. Machine translation HMMs: The Kalman filter was invented in the 60’s and first implemented as a method of trajectory estimation for the Apollo program. [See Observations are words (tens of thousands) course website for a historical account on the Kalman filter. “From States are translation options Gauss to Kalman”] Example: Robot Localization Example: Robot Localization Example from Michael Pfeiffer Prob 0 1 Prob 0 1 t=0 Sensor model: never more than 1 mistake t=1 Know the heading (North, East, South or West) Motion model: may not execute action with small prob. Example: Robot Localization Example: Robot Localization Prob 0 1 Prob 0 1 t=2 t=3 Page 3 Example: Robot Localization Example: Robot Localization Prob 0 1 Prob 0 1 t=4 t=5 Inference: Base Cases Time update Incorporate observation Time update Assume we have current belief P(X | evidence to date) X X X 1 1 2 X X Then, after one time step passes: t t+1 Z1 Observation update Algorithm Init P(x 1) [e.g., uniformly] Assume we have: Xt+1 Observation update for time 0: Et+1 For t = 1, 2, … Then: Time update Observation update For continuous state / observation spaces: simply replace summation by integral Page 4 Example HMM The Forward Algorithm Time/dynamics update and observation update in one: recursive update Normalization: Can be helpful for numerical reasons Rt-1 P(Rt) Rt P(Ut) However: lose information! T 0.7 T 0.9 Can renormalize (for numerical reasons) + keep track of the F 0.3 F 0.2 normalization factor (to enable recovering all information) The likelihood of the observations With control inputs U1 U2 U3 U4 X X X X X The forward algorithm first sums over x1, then over x2 and so forth, 1 2 3 4 5 which allows it to efficiently compute the likelihood at all times t, indeed: Z1 Z2 Z3 Z4 E5 Relevance: Compare the fit of several HMM models to the data Could optimize the dynamics model and observation model to maximize the likelihood Run multiple simultaneous trackers --- retain the best and split again whenever applicable (e.g., loop closures in SLAM, or different flight maneuvers) With control inputs Smoothing Thus far, filtering, which finds: U1 U2 U3 U4 The distribution over states at time t given all evidence until time t: X1 X2 X3 X4 X5 The likelihood of the evidence up to time t: Z1 Z2 Z3 Z4 E5 How about? Control inputs known: They can be simply seen as selecting a particular dynamics function T < t : can simply run the forward algorithm until time t, but stop Control inputs unknown: incorporating evidence from time T+1 onwards Assume a distribution over them T > t : need something else Above drawing assumes open-loop controls.

Overview Examples Probability Review Probability Review Independence

Cos513 Lecture 2 Conditional Independence and Factorization

Data-Mining Probabilists Or Experimental Determinists?

Bayesian Networks

(Ω, F,P) Be a Probability Spac

A Bunched Logic for Conditional Independence

Week 2: Causal Inference

Conditional Independence and Its Representations*

Causality, Conditional Independence, and Graphical Separation in Settable Systems

The Hardness of Conditional Independence Testing and the Generalised Covariance Measure

STAT 535 Lecture 2 Independence and Conditional Independence C

Contents 7 Conditioning

Fast and Powerful Conditional Randomization Testing Via Distillation