Image Segmentation

School of Computer Science Probabilistic Graphical Models Conditional Random Fields & Case study I: image segmentation Eric Xing Lecture 11, February 21, 2014 Reading: See class website © Eric Xing @ CMU, 2005-2014 1 Hidden Markov Model revisit Transition probabilities between y1 y2 y3 ... yT any two states j i ... p( yt 1 | yt1 1) ai, j , xA1 xA2 xA3 xAT or i p( yt | yt1 1) ~ Multinomialai,1 , ai,2 ,, ai,M ,i I. Start probabilities p( y1 ) ~ Multinomial 1 , 2 ,, M . Emission probabilities associated with each state i p(xt | yt 1) ~ Multinomialbi,1 ,bi,2 ,,bi,K ,i I. i or in general: p(xt | yt 1) ~ f |i ,i I. © Eric Xing @ CMU, 2005-2014 2 Inference (review) Forward algorithm def k k t t 1t (k ) P (x1 ,...,xt 1 ,xt ,yt 1) The matrix-vector form: k k i t p (xt | yt 1)t 1ai ,k def i i Bt (i ) p (xt | yt 1) Backward algorithm def j i k i i A(i , j ) p (yt 1 1 | yt 1) t ak ,i p(xt1 | yt1 1)t1 i def k k T t t 1t (k ) P (xt 1 ,...,xT | yt 1) t A t 1 . Bt def i i i i i, j t At 1 . Bt 1 t p( yt 1| x1:T ) t t t j . B T . A def t t t 1 t 1 i, j i j t p( yt 1, yt1 1, x1:T ) t t . t i j t1t ( yt 1)tt1 ( yt1 1) p(xt1 | yt1 ) p( yt1 | yt ) i ,j i j i t t t 1ai ,j p (xt 1 | yt 1 1) © Eric Xing @ CMU, 2005-2014 3 Learning HMM Supervised learning: estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls Unsupervised learning: estimation when the “right answer” is unknown Examples: GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters of the model to maximize P(x|) - -- Maximal likelihood (ML) estimation © Eric Xing @ CMU, 2005-2014 4 Learning HMM: two scenarios Supervised learning: if only we knew the true state path then ML parameter estimation would be trivial E.g., recall that for complete observed tabular BN: T y i y j ML # (i j ) n t 2 n ,t 1 n ,t aij T nijk # (i ) y i ML n t 2 n ,t 1 ijk T i k nij 'k # (i k ) yn ,t xn ,t ML n t 1 i ,j ',k bik T # (i ) y i n t 1 n ,t What if y is continuous? We can treat x n ,t , y n , t : t 1 : T ,n 1 : N as NT observations of, e.g., a GLIM, and apply learning rules for GLIM … Unsupervised learning: when the true state path is unknown, we can fill in the missing values using inference recursions. The Baum Welch algorithm (i.e., EM) Guaranteed to increase the log likelihood of the model after each iteration Converges to local optimum, depending on initial conditions © Eric Xing @ CMU, 2005-2014 5 The Baum Welch algorithm The complete log likelihood T T (θ;x, y) log p(x, y) log p( y ) p( y | y ) p(x | x ) lc n,1 n,t n,t1 n,t n,t n t2 t1 The expected complete log likelihood T T i i j k i lc (θ;x, y) yn,1 log i yn,t1 yn,t log ai, j xn,t yn,t logbi,k p( yn , |xn ) p( yn ,t , yn ,t |x n ) p( yn ,t |xn ) n 1 n t2 1 n t 1 EM The E step i i i n,t yn,t p( yn,t 1 | x n ) i, j i j i j n,t yn,t1 yn,t p( yn,t1 1, yn,t 1 | x n ) The M step ("symbolically" identical to MLE) T i, j T i k i x ML n t2 n,t ML n t 1 n,t n,t ML n n,1 a b ij T 1 ik T 1 i i i N n t1 n,t n t 1 n,t © Eric Xing @ CMU, 2005-2014 6 Shortcomings of Hidden Markov Model (1): locality of features Y1 Y2 … … … Yn X1 X2 … … … Xn HMM models capture dependences between each state and only its corresponding observation NLP example: In a sentence segmentation task, each segmental state may depend not just on a single word (and the adjacent segmental stages), but also on the (non-local) features of the whole line such as line length, indentation, amount of white space, etc. Mismatch between learning objective function and prediction objective function HMM learns a joint distribution of states and observations P(Y, X), but in a prediction task, we need the conditional probability P(Y|X) © Eric Xing @ CMU, 2005-2014 7 Solution: Maximum Entropy Markov Model (MEMM) Y1 Y2 … … … Yn X1:n Models dependence between each state and the full observation sequence explicitly More expressive than HMMs Discriminative model Completely ignores modeling P(X): saves modeling effort Learning objective function consistent with predictive function: P(Y|X) © Eric Xing @ CMU, 2005-2014 8 Then, shortcomings of MEMM (and HMM) (2): the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.4 0.45 0.5 State 1 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 What the local transition probabilities say: • State 1 almost always prefers to go to state 2 • State 2 almost always prefer to stay in state 2 © Eric Xing @ CMU, 2005-2014 9 MEMM: the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.45 0.5 State 1 0.4 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 Probability of path 1-> 1-> 1-> 1: • 0.4 x 0.45 x 0.5 = 0.09 © Eric Xing @ CMU, 2005-2014 10 MEMM: the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.45 0.5 State 1 0.4 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 Probability of path 2->2->2->2 : Other paths: • 0.2 X 0.3 X 0.3 = 0.018 1-> 1-> 1-> 1: 0.09 © Eric Xing @ CMU, 2005-2014 11 MEMM: the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.45 0.5 State 1 0.4 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 Probability of path 1->2->1->2: Other paths: • 0.6 X 0.2 X 0.5 = 0.06 1->1->1->1: 0.09 2->2->2->2: 0.018 © Eric Xing @ CMU, 2005-2014 12 MEMM: the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.45 0.5 State 1 0.4 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 Probability of path 1->1->2->2: Other paths: • 0.4 X 0.55 X 0.3 = 0.066 1->1->1->1: 0.09 2->2->2->2: 0.018 © Eric Xing @ CMU, 2005-20141->2->1->2: 0.06 13 MEMM: the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.45 0.5 State 1 0.4 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 Most Likely Path: 1-> 1-> 1-> 1 • Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2. • why? © Eric Xing @ CMU, 2005-2014 14 MEMM: the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.45 0.5 State 1 0.4 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 Most Likely Path: 1-> 1-> 1-> 1 • State 1 has only two transitions but state 2 has 5: • Average transition probability from state 2 is lower © Eric Xing @ CMU, 2005-2014 15 MEMM: the Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 0.45 0.5 State 1 0.4 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 Label bias problem in MEMM: • Preference of states with lower number of transitions over others © Eric Xing @ CMU, 2005-2014 16 Solution: Do not normalize probabilities locally Observation 1 Observation 2 Observation 3 Observation 4 0.4 0.45 0.5 State 1 0.2 0.2 0.1 0.6 0.55 0.5 0.2 0.3 0.3 State 2 0.2 0.1 0.2 State 3 0.2 0.1 0.2 State 4 0.2 0.3 0.2 State 5 From local probabilities ….

Image Segmentation

One-Network Adversarial Fairness

Remote Sensing

On Construction of Hybrid Logistic Regression-Naıve Bayes Model For

Data-Driven Topo-Climatic Mapping with Machine Learning Methods

A Discriminative Model to Predict Comorbidities in ASD

Performance Comparison of Different Pre-Trained Deep Learning Models in Classifying Brain MRI Images

A Latent Model of Discriminative Aspect

On Surrogate Supervision Multi-View Learning

Generative and Discriminative Models

Adaptive Discriminative Generative Model and Its Applications

Logistic Regression, Generative and Discriminative Classifiers

Discriminative and Generative Models for Clinical Risk Estimation: an Empirical Comparison