Iowa State University Department of Science Laboratory

MACHINE LEARNING

Vasant Honavar Artificial Intelligence Research Laboratory Department of and Program Center for Computational Intelligence, Learning, & Discovery [email protected] www.cs.iastate.edu/~honavar/ www.cild.iastate.edu/

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Learning Bayesian networks

E B

Data + L R A Prior information E B P(A | E,B) e b 0.9 0.1 C e b 0.2 0.8 e b 0.9 0.1 e b 0.01 0.99

Copyright Vasant Honavar, 2006.

1 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The Learning Problem

Known Structure Unknown Structure Complete Data Statistical Discrete optimization parametric over structures estimation (discrete search) (closed-form eq.)

Incomplete Data Parametric Combined optimization (Structural EM, mixture (EM, gradient models…)

descent...)

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over estimation structures Data (closed-form eq.) (discrete search)

Incomplete Parametric Combined optimization (Structural EM, mixture Data (EM, gradient descent...) models…)

E, B, A

E B P(A | E,B) E B E B P(A | E,B) e b ?? . L A e b 0.9 0.1 e b ?? . e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b ?? e b 0.01 0.99 A

Copyright Vasant Honavar, 2006.

2 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search)

Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…)

E, B, A

E B P(A | E,B) E B e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search)

Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…)

E, B, A

E B P(A | E,B) E B e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A

Copyright Vasant Honavar, 2006.

3 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search)

Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…)

E, B, A

E B P(A | E,B) E B e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Bayesian Networks Known Structure Unknown Structure

Complete data

Incomplete data

» Parameter learning: Complete data (Review) • Statistical parametric fitting • Maximum likelihood estimation • Bayesian inference • Parameter learning: Incomplete data • Structure learning: Complete data • Application: classification • Structure learning: Incomplete data

Copyright Vasant Honavar, 2006.

4 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Estimating probabilities from data (discrete case)

• Maximum likelihood estimation • Bayesian estimation • Maximum a posteriori estimation

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian estimation

• Treat the unknown parameters as random variables • Assume a prior distribution for the unknown parameters • Update the distribution of the parameters based on data • Use Bayes rule to make prediction

Copyright Vasant Honavar, 2006.

5 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Bayesian Networks and Bayesian Prediction

θY|X θX θX

m θY|X X[m] X[1] X[2] X[M] X[M+1]

Y[m] Y[1] Y[2] Y[M] Y[M+1]

Plate notation Observed data Query

• Priors for each parameter group are independent • Data instances are independent given the unknown parameters

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Bayesian Networks and Bayesian Prediction

θY|X θX θX

m θY|X X[m] X[1] X[2] X[M] X[M+1]

Y[m] Y[1] Y[2] Y[M] Y[M+1]

Plate notation Observed data Query

• We can “read” from the network: • Complete data ⇒ posteriors on parameters are independent

Copyright Vasant Honavar, 2006.

6 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian Prediction (cont.)

• Since posteriors on parameters for each node are independent, we can compute them separately • Posteriors for parameters within a node are also independent:

θX θX

m θ m Y|X Refined model θY|X=0 X[m] X[m] θY|X=1

Y[m] Y[m]

• Complete data ⇒ the posteriors on θY|X=0 and θ Y|X=1 are independent

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian Prediction

• Given these observations, we can compute the posterior for each multinomial θ independently Xi | pai • The posterior is Dirichlet with parameters

• α(Xi=1|pai)+N (Xi=1|pai),…, α(Xi=k|pai)+N (Xi=k|pai) • The predictive distribution is then represented by the parameters ~ α (x , pa ) + N (x , pa ) θ = i i i i xi | pai α ( pa i ) + N ( pa i )

Copyright Vasant Honavar, 2006.

7 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Assigning Priors for Bayesian Networks

• We need the α(xi,pai) for each node xj

• We can use initial parameters Θ0 as prior information

• Need also an equivalent sample size parameter M0

• Then, we let α(xi,pai) = M0•P(xi,pai|Θ0)

• This allows update of a network in response to new data

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Parameters

• Comparing two distribution P(x) (true model) vs. Q(x) (learned distribution) -- Measure their KL Divergence P(x) KL(P || Q) = ∑ P(x)log x Q(x)

– KL(P||Q) ≥ 0 – KL(P||Q) = 0 iff are P and Q equal

Copyright Vasant Honavar, 2006.

8 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Parameters: Summary

• Estimation relies on sufficient statistics

• For multinomial these are of the form N (xi,pai) • Parameter estimation

N (x , pa ) α (x , pa ) + N (x , pa ) ˆθ = i i ~θ = i i i i x i |pa i x i |pa i N ( pa i ) α ( pa i ) + N ( pa i ) MLE Bayesian (Dirichlet)

• Bayesian methods also require choice of priors • Both MLE and Bayesian estimates are asymptotically equivalent and consistent but the latter work better with small samples • Both can be implemented in an on-line manner by accumulating sufficient statistics

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over Data estimation structures (closed-form eq.) (discrete search)

Incomplete Parametric optimization Combined Data (EM, gradient descent...) (Structural EM, mixture models…)

E, B, A

E B P(A | E,B) E B e b ?? E B P(A | E,B) . L e b 0.9 0.1 e b ?? . A e b ?? e b 0.2 0.8 e b ?? E B e b 0.9 0.1 e b 0.01 0.99 A

Copyright Vasant Honavar, 2006.

9 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Why do we need accurate structure?

Earthquake Alarm Set Burglary

Sound

Missing an arc Extraneous arc

Earthquake Alarm Set Burglary Earthquake Alarm Set Burglary

Sound Sound

• Cannot be compensated for by • Increases the number of fitting parameters parameters to be estimated • Incorrect independence • Incorrect independence assumptions assumptions

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Approaches to BN Structure Learning

• Score based methods – assign a score to each candidate BN structure using a suitable scoring function – Search the space of candidate network structures for a BN structure with the maximum score • Independence testing based methods – Use independence tests to determine the structure of the network

Copyright Vasant Honavar, 2006.

10 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Score-based BN Structure Learning

Define scoring function that evaluates how well a structure matches the data

E, B, A . E . E B E A A A B B

Search for a structure that maximizes the score

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Need for parsimony

Copyright Vasant Honavar, 2006.

11 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Basic idea: Minimum description length (MDL) principle

hMAP = argmax P(h | D) h∈H P(D | h)P(h) = argmax h∈H P(D) = argmax P(D | h)P(h) h∈H

hMAP = argmin()− log P(D | h) − log P(h) h∈H

hMDL = argmin()CD|h ()D | h + Ch (h) h∈H

We need to design a scoring function that minimizes the description length of the hypothesis and the description length of the data given the hypothesis. In this case, the hypothesis is a Bayesian network which represents a joint probability distribution

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Scoring function

A BN scoring function consists of • A term that corresponds to the number of bits needed to encode the BN structure and parameters • A term that corresponds to the number of bits needed to encode the data given the BN

We proceed to specify each of these terms

Copyright Vasant Honavar, 2006.

12 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Encoding a Bayesian Network It suffices to • list the parents of each node • record the conditional probabilities associated with each node

Consider a BN with n variables.

• Consider a node i with ki parents. • We need ki log2 n bits to list its parents. • Suppose the node i (variable Xi) takes si distinct values. • Suppose the j th parent takes sj distinct values. • Suppose we use d bits to store each conditional probability.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Encoding a Bayesian Network

• Under the encoding scheme described, the description length of a particular Bayesian network is given by • n ⎛ ⎞ ⎜k log ()n + d (s −1 ) s ⎟ ∑⎜ i 2 i ∏ j ⎟ i=1 ⎝ X j∈Parents( X i ) ⎠

Copyright Vasant Honavar, 2006.

13 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Encoding the Data

• Suppose we have M independent observations (instantiations) of the random variables X 1.....X n

Let Vi be the domain of random variable Xi

Each observation corresponds to an atomic event ek ∈V1 ×V2 ×..Vn

Let pk be the probability of ek

When M is large, we expect M pk occurrences of ek among the M observations. Under optimal encoding, the number of bits needed to encode the data is − M ∑ pk log2 ()pk ek ∈V1×...Vn

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Encoding the Data ..using a Bayesian network

• But.. We do not know pk - the probability of ek! • What we have instead is a Bayesian network which

assigns a probability qk to ek • When we use the learned network to encode the data, the number of bits needed to encode the network (and hence the data using the network) is

− M ∑ pk log2 ()qk ek ∈V1×...Vn

Copyright Vasant Honavar, 2006.

14 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Encoding the Data ..using a Bayesian network

Theorem (Gibbs):

− M ∑ pk log2 ()pk ≤ −M ∑ pk log2 ()qk ek ∈V1×...Vn ek ∈V1×...Vn

with equality holding if and only if ∀i pi = qi

• Number of bits needed to encode the data if true probabilities of each atomic event are known is less than or equal to the number of bits needed to encode the data using a code based on the estimated probabilities.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Putting the two together

• MDL principle recommends minimizing the sum of the encoding lengths of the model (Bayes network) and the encoding length of the data using the model N ⎛ ⎞ ⎜k log ()n + d (s −1 )s ⎟ − M p log (q ) ∑⎜ i 2 i ∏ j ⎟ ∑ k 2 k i=1 ⎝ X j∈Parents( X i ) ⎠ ek ∈V1×...Vn

• Problems with evaluating the second term:

– we do not know the probabilities pk – the second term requires summation over all atomic events (all instantiations of the n random variables)

Copyright Vasant Honavar, 2006.

15 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kullback-Leibler Divergence to the rescue!

• Let P and Q be two probability distributions over the

same event space such that an event ei is assigned probability pi by P and qi by Q

KL(P ||Q) = ∑ pk (logpk −logqk ) k KL(P ||Q) ≥ 0 KL(P ||Q) = 0 iff P = Q

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Kullback-Leibler Divergence to the rescue!

• Theorem: The encoding length of the data is a monotonically increasing function of the KL divergence between the distribution Q defined by the model and the true distribution P. • Hence, we can use the estimated KL divergence as a proxy for the encoding length of the data (using the model) to score a model. • We can use local computations over a Bayes network to evaluate KL(P || Q) = ∑ pk (log pk − logqk ) k

Copyright Vasant Honavar, 2006.

16 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Applying the MDL Principle

• Exhaustive search over the space of all networks infeasible! • Evaluating KL-divergence directly is infeasible! • Hence we need to • Resort to a heuristic search to find a network with a near minimal description length • Develop a more efficient method of evaluating KL divergence of a candidate network

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Evaluating KL divergence for a network • Theorem (Chou and Liu, 1969). Suppose we define

mutual information between any two nodes Xi and Xj as

P(X i ,X j ) W()Xi ,X j = ∑P ()Xi ,X j log2 ( X ,X ) P()X P()X i j i j

• Then the cross entropy KL(P||Q) over all tree-structured distributions is minimized when the graph representing

Q(X1 .. Xn) is a maximum weight spanning tree of the graph where the edge between nodes Xi and Xj is assigned the weight equal to W(Xi , Xj ). • The resulting tree structured model can be shown to correspond to the maximum likelihood model among all tree structured models

Copyright Vasant Honavar, 2006.

17 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Evaluating KL divergence for a network • Theorem (Lam and Bacchus, 1994). Suppose we define

weight measure between a nodes Xi and a set of arbitrary parents Parents(Xi)

P(X i ,Parents()X i ) W ()X i ,Parents()X i = ∑ P()X i ,Parents ()X i log2 ( X i ,Parents( X i )) P()X i P()Parents ()X i • Then the cross entropy KL(P||Q) for a Bayesian network

representing Q(X1 .. Xn) is a monotonically decreasing function of n ∑W()X i ,Parents(Xi ) i=1,Parents(Xi )≠∅ • Hence, KL(P||Q) is minimized if and only if this sum of weights is maximized

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory In words…

• If we find a Bayes network that maximizes n ∑ W ()X i , Parents ( X i ) i =1, Parents ( X i )≠ ∅ • Then the probability distribution Q modeled by network will be closest to the underlying distribution P from which the data have been sampled with respect to KL(P||Q) • Theorem: It is always possible to decrease KL(P||Q) by adding arcs to the network. • Hence the need for MDL!

Copyright Vasant Honavar, 2006.

18 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory In summary

• We need to find a Bayes network that maximizes

n ∑W ()X i , Parents()X i i=1,Parents( X i )≠∅

• While minimizing

N ⎛ ⎞ ⎜k log ()n + d (s −1 ) s ⎟ ∑⎜ i 2 i ∏ j ⎟ i=1 ⎝ X j∈Parents( X i ) ⎠

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Alternative Scoring Functions - Notation

Each X i takes ri distinct values

si = ∏ rj X j∈Parents()X i

θijk = probability that X i takes the jth value in its domain

given the kth instantiation of its parent set Parents()X i

ηijk are the pseudocounts (from the Dirichlet prior)

Nijk are the observed counts for the corresponding instantiation

Nik = ∑ Nijk ; ηik = ∑ηijk ; j j

Copyright Vasant Honavar, 2006.

19 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian scoring function

• Let B = () G , P () θ be a Bayesian network with graph structure G and probability distribution P ( θ ) over a set of n random variables X • Prior probability distribution p(B) over the networks = p()G,θ • Posterior probability given data D is given by p()G,θ, D p(G,θ, D) p(D,G,θ ) p()G,θ D = = = p()D ∑ p()G,θ, D ∑ p()D,G,θ G,θ G,θ p()G p()θ D,G p()D G,θ = ∑ p()G,θ p()D G,θ D,θ ∝ p()G p()θ D,G p()D G,θ

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian scoring function n riis Nijk p()D G,θ ∝ ∏∏∏θijk i===111j k where n is the number of random variables

ri is the number of parents of node i

si is the number of instantiations of the parents of node i

Nijk = the corresponding counts estimated from D

n riis ηijk −1 p()θ G ∝ ∏∏∏θijk i===111j k

n riis Nijk +ηijk −1 p()θ G, D ∝ ∏∏∏θijk i===111j k

ηijk = the corresponding pseudocounts Copyright Vasant Honavar, 2006.

20 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Geiger Heckerman Scoring Function

• Geiger-Heckerman Measure for a BN with graph G and parameters Θ

Q G, D = log p G + log p D | G,Θ p Θ | G dΘ GH ( ) ( ) ∫ ( ) ( )

n ⎧ siir ⎫ ⎪ ⎪⎧ Γ()η Γ()ηijk + Nijk ⎪⎫⎪ = log p()G + ⎨ ⎨log ik + log ⎬⎬ ∑∑Γ η + N ∑ Γ η i==11⎩⎪ k ⎩⎪ ()ik ik j = 1 ()ijk ⎭⎪⎭⎪

• Drawback – Does not explicitly penalize complex networks

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Cooper-Herskovits Scoring Function

• Cooper-Herskovits Measure for a BN with graph G and parameters Θ

n ⎧ siir ⎫ ⎪ ⎧ Γ(ri ) ⎫⎪ QCH ()G, D = log p ()G + ∑∑⎨ ⎨log + ∑logΓ()1+ Nijk ⎬⎬ i==11⎩⎪ k ⎩ Γ()ri + Nik j = 1 ⎭⎭⎪

• Drawback – Does not explicitly penalize complex networks

Copyright Vasant Honavar, 2006.

21 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Standard Bayesian Measure

• Standard Bayesian Measure for a BN with graph G and parameters Θ

QBayes ()G, D = log p ()G

n riis ()Nijk +ηijk −1 + ∑∑∑()Nijk +ηijk −1 log i===111j k ()Nik +ηik − ()ri −1 1 − Dim()G log N 2 where Dim()G is the number of parameters in the BN and N is the sample size 1 log N is the average number of bits needed to store 2 a number between 1 and N

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Standard Bayesian Measure – Asymptotic version

• Asymptotic version of the standard Bayesian Measure for a BN with graph G and parameters Θ

QAsymBayes ()G, D = QMDL ()G, D

n riis ⎛ N ⎞ 1 ⎜ ijk ⎟ ⎛ ⎞ = log p()G + ∑∑∑Nijk log⎜ ⎟ − ⎜ ⎟Dim()G log N i===111j k ⎝ Nik ⎠ ⎝ 2 ⎠

Copyright Vasant Honavar, 2006.

22 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Asymptotic Information Measures

QI ()B, D = log p ()G ⎛ N ⎞ ⎜ ijk ⎟ + ∑∑∑Nijk log⎜ ⎟ ijk ⎝ Nik ⎠ − dim()B f ()D where f ()D is a non -negative penalty function f ()D = 0 for maximum likelihood information criterion f ()D =1 for Akaike information criterion 1 f ()D = ()log N for Schwartz information criterion 2 Note : MDL is a special case of this measure

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Structure Search as Optimization

• Input: • Training data • Scoring function • Set of possible structures

• Output: • A network that maximizes the score • Key Computational Property: Decomposability: score(G) = ∑ score ( “family” of X in G )

Copyright Vasant Honavar, 2006.

23 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Tree-Structured Networks

MINVOLSET

PULMEMBOLUS INTUBATION KINKEDTUBE Trees: VENTMACH DISCONNECT

PAP SHUNT VENTLUNG • At most one parent per variable VENITUBE

PRESS Why trees? MINOVL VENTALV

• Elegant mathematics PVSAT ARTCO2

• we can exactly and efficiently TPR SAO2 INSUFFANESTH EXPCO2 solve the optimization problem HYPOVOLEMIA LVFAILURE CATECHOL • Sparse parameterization LVEDVOLUME STROEVOLUME HISTORY ERRBLOWOUTPUT HR ERRCAUTER

• avoid overfitting CVP PCWP CO HREKG HRSAT HRBP

BP

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Trees

• Let p(i) denote parent of Xi • We can write the Bayesian score as

Score(G : D) = ∑Score(Xi : Pai ) i = − + ∑(Score(Xi : X p(i ) ) Score(Xi )) ∑Score(Xi ) i i

Improvement over Score of “empty” “empty” network network • Score = sum of edge scores + constant

Copyright Vasant Honavar, 2006.

24 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Trees

•Set w(j→i) =Score( Xj → Xi ) -Score(Xi) • Find tree (or forest) with maximal weight --Standard max spanning tree algorithm — O(n2 log n) • Theorem: This procedure finds tree with max score

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Beyond Trees

• When we consider more complex network, the problem is not as easy • Suppose we allow at most two parents per node • A greedy algorithm is no longer guaranteed to find the optimal network • In fact, no efficient algorithm exists • Theorem: Finding maximal scoring structure with at most k parents per node is NP-hard for k > 1

Copyright Vasant Honavar, 2006.

25 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Heuristic Search

• Define a search space: • search states are possible structures • operators make small changes to structure • Traverse space looking for high-scoring structures • Search techniques: • Greedy hill-climbing • Best first search • Simulated Annealing • ...

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory K2 Algorithm (Cooper and Herskovits)

• Start with an ordered list of random variables

• For each variable Xi add to its parent set, a node that is lower numbered than Xi and yields the maximum improvement in score • Repeat until score does not improve or a complete network is obtained

• Disadvantage: Requires an ordered list of nodes

Copyright Vasant Honavar, 2006.

26 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory B Algorithm (Buntine)

• Start with the parent set for each random variables initialized to an empty set • At each step, add a link (a node to the parent set of some node), that does not introduce a cycle and yields the maximum improvement in score • Repeat until score does not improve or a complete network is obtained

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Local Search

• Start with a given network – empty network – best tree – a random network • At each iteration – Evaluate all possible changes – Apply change based on score • Stop when no modification improves score

Copyright Vasant Honavar, 2006.

27 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Heuristic Search

• Typical operations: S C

E D S C → dd C A D • To update score after E local change, only re- Δscore = score families that S({C,E} →D) D R changed E ev -S({E} →D) → er C se te C S C le → S C De E

E E

D D

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning in Practice: Alarm network 2

1.5 Structure known, fit parameters

1 Learn both structure and parameters true distribution true

KL Divergence from from KL Divergence 0.5

#samples

0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Copyright Vasant Honavar, 2006.

28 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Local Search: Possible Pitfalls

• Local search can get stuck in: • Local Maxima – All one-edge changes reduce the score • Plateau – Some one-edge changes leave the score unchanged • Standard heuristics can escape both • Random restarts • TABU search • Simulated annealing

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Independence Based Methods

• Rely on independence tests to decide whether to add links between nodes in the structure search phase • Need to penalize for complex structures – Hard to beat a fully connected network! • In the most general setting, there are too many independence tests to consider • Sometimes it is possible to infer additional independences based on known (or infered) independences (See Bromberg et al., 2006 and references cited therein)

Copyright Vasant Honavar, 2006.

29 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Structure Search: Summary

• Discrete optimization problem • In some cases, optimization problem is easy • Example: learning trees • In general, NP-Hard • Need to resort to heuristic search • Or restrict connectivity – each node assumed to have no more than l parents where l is much smaller than n • Stochastic search – e.g., simulated annealing, genetic algorithms

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Structure Discovery

• Task: Discover structural properties • Is there a direct connection between X & Y • Does X separate between two “subsystems” • Does X causally effect Y • Example: scientific • Disease properties and symptoms • Interactions between the expression of genes

Copyright Vasant Honavar, 2006.

30 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

P(G|D) Discovering Structure

E B

R A

C

• Model selection – Pick a single high-scoring model – Use that model to infer domain structure

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Discovering Structure P(G|D)

E B E B E B E B E B

R A R A R A R A R A

C C C C C

•Problem – Small sample size ⇒ many high scoring models – Answer based on one model often useless – Want features common to many models

Copyright Vasant Honavar, 2006.

31 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bayesian Approach

• Posterior distribution over structures • Estimate probability of features – Edge X→Y Bayesian score – Path X→… → Y for G – … P( f | D) = ∑ f (G)P(G | D) G

Feature of , G Indicator function e.g., → X Y for feature f

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory MCMC over Networks

• Cannot enumerate structures, so sample structures • MCMC Sampling • Define Markov chain over BNs • Run chain to get samples from posterior P(G | D) 1 n P( f (G) | D) ≈ ∑ f (Gi ) • Possible pitfalls: n i=1 • Huge (super-exponential) number of networks • Time for chain to converge to posterior is unknown • Islands of high posterior, connected by low bridges

Copyright Vasant Honavar, 2006.

32 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Fixed Ordering

• Suppose that • We know the ordering of variables k•n•log n –say, X1 > X2 > X3 > X4 > … > Xn 2 networks parents for Xi must be in X1,…,Xi-1 • Limit number of parents per nodes to k Intuition: Order decouples choice of parents

• Choice of Pa(X7) does not restrict choice of Pa(X12) Upshot: Can compute efficiently in closed form • Likelihood P(D | p) • Feature probability P(f | D, p)

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Sample Orderings

• We can write P( f | D) = ∑ P( f |p, D)P(p| D) p • Sample orderings and approximate

n P( f | D) ≈ ∑ P( f | pi , D) •MCMC Sampling i=1 – Define Markov chain over orderings – Run chain to get samples from posterior P (p | D)

Copyright Vasant Honavar, 2006.

33 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Application: Gene expression Data Analysis

Friedman et al., 2001 • Input: • Measurement of gene expression under different conditions – Thousands of genes – Hundreds of experiments

• Output: • Models of gene interaction – Uncover pathways

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory “Mating response” Substructure

SST2 KAR4

FUS1 TEC1 NDJ1 KSS1 PRM1 AGA1

YLR343W AGA2 TOM6 FIG1 FUS3

STE6 YEL059W YLR334C MFA1

• Automatically constructed sub-network of high-confidence edges • Almost exact reconstruction of yeast mating pathway Copyright Vasant Honavar, 2006.

34 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over estimation structures (closed-form eq.) (discrete search)

Incomplete Parametric optimization Combined (EM, gradient descent...) (Structural EM, mixture models…)

E, B, A E B . L A E B P(A | E,B) E B P(A | E,B) . e b .9 .1 e b ?? e b .7 .3 e b ?? E B e b .8 .2 e b ?? A e b .99 .01 e b ?? Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Incomplete Data

• Data are often incomplete • Some variables of interest are not assigned values

This phenomenon occurs when we have • Missing values • Some variables unobserved in some instances • Hidden variables • Some variables are never observed • We might not even know they exist

Copyright Vasant Honavar, 2006.

35 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Hidden (Latent) Variables

• Why should we care about hidden variables?

X1 X2 X3 X1 X2 X3

H

Y1 Y2 Y3 Y1 Y2 Y3

17 parameters 59 parameters

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Incomplete Data

• In the presence of incomplete data, the likelihood can have multiple maxima

H Y •Example: • If H has two values, likelihood has two maxima • In practice, many local maxima

Copyright Vasant Honavar, 2006.

36 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Expectation Maximization (EM)

• A general purpose method for learning from incomplete data Intuition: • If we had true counts, we could estimate parameters • But with missing values, counts are unknown • We “complete” counts using probabilistic inference based on current parameter assignment • We use completed counts as if they were real to re- estimate parameters

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Expectation Maximization (EM)

Data Expected Counts P(Y=H|X=H,Z=T,Θ) = 0.3 X Y Z N (X,Y ) H ? T XY # T ? ? Current H H 1.3 H H ? model T H 0.4 H T T H T 1.7 T T H T T 1.6 P(Y=H|X=T,Θ) = 0.4

Copyright Vasant Honavar, 2006.

37 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Expectation Maximization (EM) Iterate

Initial network (G,Θ0) Updated network (G,Θ1)

X1 X2 X3 X1 X2 X3 H Computation Expected Reparameterize H Counts Y Y Y 1 2 3 (E-Step) N(X1) (M-Step) Y Y Y + N(X2) 1 2 3 N(X3) N(H, X1, X1, X3) Training N(Y1, H) N(Y2, H) Data N(Y3, H)

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Expectation Maximization (EM)

• Formal Guarantees

• L(Θ1:D) ≥ L(Θ0:D) – Each iteration improves the likelihood

•If Θ1 = Θ0 , then Θ0 is a stationary point of L(Θ:D) – Usually, this means a local maximum

Copyright Vasant Honavar, 2006.

38 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Expectation Maximization (EM)

• Computational bottleneck: • Computation of expected counts in E-Step • Need to compute posterior for each unobserved variable in each instance of training set • All posteriors for an instance can be derived from one pass of standard BN inference

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Summary of Parameter Learning from Incomplete Data

• Incomplete data makes parameter estimation hard • Likelihood function – Does not have closed form – Is multimodal • Finding max likelihood parameters: –EM – Gradient ascent • Both exploit inference procedures for Bayesian networks to compute expected sufficient statistics

Copyright Vasant Honavar, 2006.

39 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Problem Known Structure Unknown Structure Complete Statistical parametric Discrete optimization over estimation structures (closed-form eq.) (discrete search) E, B, A Incomplete Parametric optimization Combined (EM, gradient descent...) (Structural EM, mixture models…) .

. E B L A E B P(A | E,B) E B P(A | E,B) e b .9 .1 e b ?? e b .7 .3 e b ?? E B e b .8 .2 e b ?? A e b .99 .01 e b ?? Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Incomplete Data: Structure Scores

• Recall, Bayesian score: P(G | D) ∝ P(G)P(D | G) = P(G)∫ P(D | G, Θ)P(Θ | G)dθ • With incomplete data: • Cannot evaluate marginal likelihood in closed form • We have to resort to approximations: • Evaluate score around MAP parameters • Need to find MAP parameters (e.g., EM)

Copyright Vasant Honavar, 2006.

40 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Structural EM

• Recall, in complete data we had Decomposition ⇒ efficient search

Idea: • Instead of optimizing the real score… • Find decomposable alternative score • Such that maximizing new score ⇒ improvement in real score

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Structural EM

Idea: • Use current model to help evaluate new structures

Outline: • Perform search in (Structure, Parameters) space • At each iteration, use current model for finding either: – Better scoring parameters: “parametric” EM step or – Better scoring structure: “structural” EM step

Copyright Vasant Honavar, 2006.

41 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Iterate

Score Computation & Parameterize X1 X2 X3 Expected Counts X X X N(X1) 1 2 3 H N(X2) N(X3) H N(H, X1, X1, X3) Y1 Y2 Y3 N(Y1, H) Y1 Y2 Y3 + N(Y2, H) N(Y3, H)

N(X2,X1) X1 X2 X3 Training N(H, X1, X3) N(Y1, X2) Data H N(Y2, Y1, H)

Y1 Y2 Y3

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Some Additional Graphical Models

• (Finite) Mixture Models • Graphical models for sequence data – Markov Models and Hidden Markov Models • Undirected graphical models – Markov networks – Markov random fields

Copyright Vasant Honavar, 2006.

42 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Finite Mixture Models

K p(x) = ∑ p(x,ck) k=1 K = ∑ p(x | ck)p(ck) k=1 K = ∑ p(x | ck,θk)αk k=1 Weightk

Component Parametersk Modelk

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example: Mixture of Gaussians

• Gaussian mixtures:

K p(x) = ∑ p(x | ck,θk)αk k=1

Each mixture component is a multidimensional Gaussian with its own

mean μk and covariance “shape” Σk

e.g., K=2, 1-dim:

{θ, α} = {μ1 , σ1 , μ2 , σ2 , α1}

Copyright Vasant Honavar, 2006.

43 Iowa State University Department of Computer Science 2 Artificial Intelligence Research Laboratory

1.5 Component Models

1 p(x)

0.5

0 -5 0 5 10

0.5

0.4 Mixture Model

0.3

p(x) 0.2

0.1

0 -5 0 5 10

Copyright Vasant Honavar, 2006. x

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Example: Mixture of Naïve Bayes

K p(x) = ∑ p(x | ck,θk)αk k=1

d p(x | c ,θ ) = p(x | c ,θ ) k k ∏ j k k j=1

Conditional Independence model for each component (often quite useful as a first-order approximation)

Copyright Vasant Honavar, 2006.

44 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Interpretation of Mixtures

ƒ C has a direct (physical) interpretation ƒ e.g., C = {age of fish}, C = {male, female} ƒ C might have an interpretation ƒ e.g., clusters of Web surfers ƒ C is just a convenient latent variable ƒ e.g., flexible density estimation

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Graphical Models for Mixtures

E.g., Mixtures of Naïve Bayes:

C (discrete, hidden)

(observed) X1 X2 X3

Copyright Vasant Honavar, 2006.

45 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Sequential Mixtures

C C C

X1 X2 X3 X1 X2 X3 X1 X2 X3

Time t-1 Time t Time t+1 Markov Mixtures • C has Markov dependence • Hidden Markov Model (here with naïve Bayes) C – discrete state, couples observables

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Mixture density

component densities c 6474 484 P(x | θ) = ∑ P(x | ω j ,θ j ). P(ω j ) j=1 123 mixing parameters t where θ = (θ1,θ2 ,...,θc )

•Task– – To use samples drawn from this mixture density to estimate the unknown parameter vector θ. – Once θ is known, we can decompose the mixture into its components

Copyright Vasant Honavar, 2006.

46 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Identifiability of mixture density • A density P (x | θ ) is said to be identifiable if θ≠θ’ implies that there exists an x such that P(x | θ ) ≠ P (x | θ’ ) • Example -- Consider the case where x is binary and P (x | θ) is the mixture: 1 1 P(x |θ ) = θ x (1−θ )1−x + θ x (1−θ )1−x 2 1 1 2 2 2 ⎧1 (θ +θ ) if x = 1 ⎪2 1 2 = ⎨ 1 ⎪1- (θ +θ ) if x = 0 ⎩⎪ 2 1 2 • Assume that: P (x = 1 | θ)= 0.6⇒ P (x = 0 | θ ) = 0.4

which implies θ1 + θ2 = 1.2 • But …we cannot determine mixture (why?)

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Identifying mixture distributions

• Unidentifiability of the mixture distribution suggests impossibility of unsupervised learning •Mixtures ofmany commonly encountered density functions (e.g., Gaussians) are usually identifiable. • Discrete distributions, especially when there are many components in the mixture, often result in more unknowns than there are independent equations, making identifiability impossible unless other additional information is available. • While it can be shown that mixtures of normal densities are usually identifiable, but there are scenarios where this is not the case

Copyright Vasant Honavar, 2006.

47 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Identifying mixture distributions

• While it can be shown that mixtures of normal densities are usually identifiable, but there are scenarios where this is not the case

P(ω1 ) ⎡ 1 2 ⎤ P(ω2 ) ⎡ 1 2 ⎤ P(x | θ) = exp⎢− (x − θ1 ) ⎥ + exp⎢− (x − θ2 ) ⎥ 2π ⎣ 2 ⎦ 2π ⎣ 2 ⎦

• Cannot be uniquely identified if P(ω1) = P(ω2) because θ = (θ1, θ2) and θ = (θ2, θ1) are two possible vectors that can be interchanged without affecting P (x | θ ) – we cannot recover a unique θ even from an infinite amount of data! • We focus on those cases in which the mixture distributions are identifiable

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning Mixtures from Data

• Consider fixed K

• e.g., Unknown parameters Θ = {μ1 , σ1 , μ2 , σ2 , α1 α2}

• Given data D = {x1,…….xN}, we want to find the parameters Θ that “best fit” the data

Copyright Vasant Honavar, 2006.

48 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Maximum Likelihood Principle

• assume a probabilistic model • likelihood = p(data | parameters, model) • find the parameters that make the data most likely

L(Θ) = p(D | Θ) N = ∏ p(xi |Θ) i=1 which in the case of a mixture model reduces to N ⎡ K ⎤ = ∏ ⎢∑ p(xi | ck , θk )α k ⎥ i=1 ⎣ k=1 ⎦

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The EM Algorithm

ƒ Dempster, Laird, and Rubin (1977) ƒ general framework for likelihood-based parameter estimation with missing data ƒ start with initial guesses of parameters ƒ Estep: estimate memberships given params ƒ Mstep: estimate params given memberships ƒ Repeat until convergence ƒ converges to a (local) maximum of likelihood ƒ E-step and M-step are often computationally simple ƒ generalizes to maximum a posteriori (with priors)

Copyright Vasant Honavar, 2006.

49 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

The EM Algorithm for Learning Components of a Mixture Distribution • Similar to the application of EM for handling missing attribute values in Bayesian networks. • It is the class label that is missing in the entire data set!

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory EM Method for Mixture Estimation

• Suppose that we have a set D = {x1, …, xn} of n unlabeled samples drawn independently from the mixture density K p(x | Θ) = ∑ p(x | ωk ,θk )αk k=1

L(Θ) = p(D | Θ) where Θ = {θ1....θK ,α1...α K } N = ∏ p(xi |Θ) i=1 N ⎡ K ⎤ = ∏ ⎢∑ p(xi |ωk ,θk )α k ⎥ i=1 ⎣ k =1 ⎦

Copyright Vasant Honavar, 2006.

50 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory EM Method for Mixture Estimation

Maximum likelihood estimate

N ˆ Θ = argmax p(D | Θ) with p(D|Θ) = ∏ p(xi|Θ) Θ i=1

Log likelihood is

N N ⎛ K ⎞ l = ∑ln p(xi | Θ)) = ∑∑ln⎜ α k p(xi | ωk ,θk ))⎟ i=1 i==11⎝ k ⎠

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory EM Method for Mixture Estimation

Because the ω k are unknown, we model them as a set of hidden random variables and take the expectation over possible values of Ω Unfortunately, estimating the distribution of requires knowledge of Θ. ) To break this cycle, we start with a guess for Θ = Θ

N K ˆ ⎛ ⎞ EΩl = ∑∑P()ωk xi ,Θ ln⎜ α k p(xi | ωk ,θk )⎟ i==11⎝ k ⎠

Copyright Vasant Honavar, 2006.

51 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory EM Method for Mixture Estimation

• With a bit of algebra, this expression can be simplified to yield

K N ˆ EΩl = ∑ ∑ P(ωk xi ,Θ)ln()α k p(xi |ωk k=1 i=1

• We pick the next guess for Θ so as to maximize the above expectation subject to the constraint

K α = 1 using the standard approach using ∑ k the Lagrange multiplier k =1

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Maximum likelihood mixture identification Update equations for Θ:

ˆ ˆ 1 ˆ ˆ ˆ p(xi |ωk ,θk )α k α k = ∑ P(ωk|xi ,Θ) where : P()ωk|xi ,Θ = K N ˆ ˆ ∑ p(xi |ω j ,θ j )α j j=1 N ˆ ∑ xi P()ωk | xi ,Θ i=1 θk = N ˆ ∑ P()ωk | xi ,Θ i=1

Copyright Vasant Honavar, 2006.

52 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example: Mixtures of Normal Mixtures

• p (x | ωi, θi) ~ N (μi, Σi) • Possible cases

Case μi Σi P (ωi) c 1 ? x x x 2 ? ? ? x 3 ? ? ? ?

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Case 1 – Unknown mean vectors

μi = θi ∀ i =1, …, c

d / 2 1/ 2 1 t −1 ln p(x | ωi ,μi ) = −ln[(2π) i ]− (x − μi ) (x − μi ) ∑ 2 ∑i

n ∑ P(ωi | xk ,μˆ )xk k =1 μˆ i = n (1) ∑ P(ωi | xk ,μˆ ) k =1

P(ω i | xk ,μˆ ) is the fraction of those samples having value xk that come from the ith class, and μ ˆ i is the average of the samples coming from the ith class.

Copyright Vasant Honavar, 2006.

53 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Maximum likelihood mixture identification

• Unfortunately, equation (1) does not give μˆ i explicitly • However, if we have some way of obtaining good initial ˆ estimates μ i ( 0 )for the unknown means, equation (1) provide a way to apply EM algorithm

n ∑ P(ωi | xk ,μˆ( j))xk k =1 μˆ i ( j +1) = n ∑ P(ωi | xk ,μˆ( j)) k =1

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Gradient based ML identification of mixtures • Example – Consider the simple two-component one- dimensional normal mixture

1 ⎡ 1 2 ⎤ 2 ⎡ 1 2 ⎤ p(x | μ1,μ 2 ) = exp⎢− (x − μ1 ) ⎥ + exp⎢− (x − μ 2 ) ⎥ 3 2π ⎣ 2 ⎦ 3 2π ⎣ 2 ⎦ •(2 clusters!)

•set μ1 = -2, μ2 = 2 and draw 25 samples sequentially from this mixture. The log-likelihood function is:

n l(μ1,μ 2 ) = ∑ln p(xk | μ1,μ 2 ) k =1

Copyright Vasant Honavar, 2006.

54 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Gradient based ML identification of mixtures

• The maximum value of l occurs at:

μˆ 1 = −2.130 and μˆ 2 = 1.668 • which are not far from the

true values: μ1 = -2 and μ2 = +2

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Identifying mixtures of normals when all parameters unknown

If no constraints are placed on the covariance matrix – ML principle results in useless singular solutions because it is possible to make the likelihood arbitrarily large. – In practice, we get useful results by focusing on the largest of the finite local maxima of the likelihood function or by applying the Minimum description length principle.

Copyright Vasant Honavar, 2006.

55 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Maximum Likelihood estimation of mixtures of normals – the general case

n ˆ 1 ˆ ˆ P(ωi ) = ∑P(ωi | xk ,θ) n k=1 n ˆ ˆ ∑P(ωi | xk ,θ)xk k=1 μˆ i = n ˆ ˆ ∑P(ωi | xk ,θ) k=1 n ˆ ˆ t ∑P(ωi | xk ,θ)(xk − μˆ i )(xk − μˆ i ) ˆ k=1 Σi = n ˆ ˆ ∑P(ωi | xk ,θ) k=1

−1/ 2 ⎡ 1 t ˆ −1 ⎤ ˆ Σi exp⎢− (xk − μˆ i ) Σi (xk − μˆ i )⎥P(ωi ) ˆ ˆ ⎣ 2 ⎦ P(ωi | xk ,θ) = c −1/ 2 ˆ ⎡ 1 t ˆ −1 ⎤ ˆ ∑ Σ j exp⎢− (xk − μˆ j ) Σ j (xk − μˆ j )⎥P(ω j ) j=1 ⎣ 2 ⎦

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Markov Models, Hidden Markov Models

Outline • Bag of words, n-grams, and related models • Markov models • Hidden Markov models • Higher order Markov models • Variations on Hidden Markov Models • Applications

Copyright Vasant Honavar, 2006.

56 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Applications of Sequence Classifiers

• Speech recognition • Natural language processing • Text processing • Gesture recognition • Biological sequence analysis • gene identification • protein classification

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bag of words, n-grams and related models

• Map arbitrary length sequences to fixed length feature representations • Bag of words – represent sequences by feature vectors with as many components as there are words in the vocabulary • n-grams – short subsequences of n letters • Ignore relative ordering of words or n-grams along the sequence •“cat chased the mouse” and “mouse chased the cat” have identical bag of words representations

Copyright Vasant Honavar, 2006.

57 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Bag of words, n-grams and related models Fixed length feature representations make it possible to apply methods that work with feature-based representations Features • Given (as in the case of words English vocabulary) • Discovered from data –statistics of occurrence of n-grams in data • If variable length n-grams are allowed, need to take into account possible overlaps • Computation of n-gram frequencies can be made efficient using dynamic programming • if a string appears k times in a piece of text, any substring of the string appears at least k times in the text

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Markov models (Markov Chains)

A Markov model is a probabilistic model of symbol sequences in which the probability of the current event is depends only on the immediately preceding event.

Consider a sequence of random variables X1, X2, …, XN. Think of the subscripts as indicating word position in a sentence or a letter position in a sequence Recall that a random variable is a function • In the case of sentences made of words, the range of the random variables is the vocabulary of the language. • In the case of DNA sequences, the random variables take on values from a 4-letter alphabet {A, C, G, T}

Copyright Vasant Honavar, 2006.

58 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Simple Model - Markov Chains

Markov Property: The state of the system at time t+1 only depends on the state of the system at time t

P[X t+1 = xt+1 | X t = xt , X t-1 = xt -1 , . . . , X1 = x1 , X 0 = x0 ]

= P[X t+1 = xt+1 | X t = xt ]

X1 X2 X3 X4 X5

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Markov chains

The fact that subscript “1” appears on both the X and the x in

“X1 = x1“ is a bit abusive of notation. It might be better to write: P(X = x ,X = x ,...,X = x ) 1 s1 2 s2 t st

where ∀j x ∈{v ...... v }= Range (X ) s j 1 L j

In what follows, we will abuse notation

Copyright Vasant Honavar, 2006.

59 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Markov Chains

Stationarity: Probabilities are independent of t

P[X t+1 = xj | X t = xi ] = aij

This means that if system is in state i, the probability that

the system will transition to state j is pij regardless of the value of t

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Describing a Markov Chain

A Markov chain can be described by the transition matrix A and initial state probabilities Q:

aij = P(X t+1 = j | X t = i)

qi = P(X 1 = i)

T −1 P(X , , X ) = P(X )P(X | X ) P(X | X ) = q A(X , X ) 1 K T 1 2 1 K T T −1 X1 ∏ t t+1 t=1

Copyright Vasant Honavar, 2006.

60 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Two ways to represent the conditional probability table of a first-order Markov process Current symbol .7 .7 A B C .3 .2 A B Next A .7 .3 0 Symbol B .2 .7 .5 0 .1 0 .5 C .1 0 .5 C

.5

Sample string: CCBBAAAAABAABACBABAAA

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The probability of generating a string

Product of probabilities, one for each term in the sequence

T p({X }T ) = p(X ) p(X | X ) t 1 1 ∏ t t−1 t=2

This comes from the table of initial probabilities This means a This is a sequence of transition symbols from probability time 1 to time T

Copyright Vasant Honavar, 2006.

61 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

The fundamental questions

Likelihood – Given a model μ = (A,Q), how can we efficiently compute the likelihood of an observation P (X | μ )?

For any state sequence (X1,…,XT): P(X ,..., X ) = q a a a 1 T x1 x1x2 x2 x3 L xT −1xT Learning – Given a set of observation sequences X, and a generic model, how can we estimate the parameters that define the best model to describe the data? Use standard estimation methods – maximum likelihood or Bayesian estimates discussed earlier in the course

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Simple Example of a Markov model

Weather

raining today rain tomorrow arr = 0.4

raining today no rain tomorrow arn = 0.6

no raining today rain tomorrow anr = 0.2

no raining today no rain tomorrow arr = 0.8

Copyright Vasant Honavar, 2006.

62 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Simple Example of a Markov model

0.3 ⎛0.4 0.6⎞ ⎛ ⎞ A = ⎜ ⎟ Q = ⎜ ⎟ ⎜0.2 0.8⎟ ⎝0.7⎠ ⎝ ⎠

Note that • both the transition matrix and the initial state matrix are Stochastic Matrices (rows sum to 1) • in general, the transition probabilities between two states need not be symmetric ( aij≠ aji ) and the probability of transition from a state to itself ( aii ) need not be zero

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Types of Markov models – Ergodic models

Ergodic model - Strongly connected – directed path with positive probabilities from each state i to each state j (but not necessarily a complete

directed graph). That is, for all i,j aij>0; aii>0

Copyright Vasant Honavar, 2006.

63 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Types of Models – LR models

Left-to-Right (LR) model -- Index of state non-decreasing with time

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Markov models with absorbing states

At each play • Gambler wins $1 with probability p or • Gambler loses $1 with probability 1-p Game ends when gambler goes broke, or gains a fortune of $100 -- Both $0 and $100 are absorbing states p p p p

0 1 2 N-1 N

1-p 1-p 1-p 1-p Start (10$)

Copyright Vasant Honavar, 2006.

64 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Coke vs. Pepsi

Given that a person’s last cola purchase was Coke, there is a 90% chance that her next cola purchase will also be Coke. If a person’s last cola purchase was Pepsi, there is an 80% chance that her next cola purchase will also be Pepsi. 0.1 0.9 0.8

coke pepsi

0.2

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Coke vs. Pepsi

Given that a person is currently a Pepsi purchaser, what is the probability that she will purchase Coke two purchases from now?

The transition matrix is:

⎡0.9 0.1⎤ (Corresponding to one A = ⎢ ⎥ purchase ahead) ⎣0.2 0.8⎦

2 ⎡0.9 0.1⎤⎡0.9 0.1⎤ ⎡0.83 0.17⎤ A = ⎢ ⎥⎢ ⎥ = ⎢ ⎥ ⎣0.2 0.8⎦⎣0.2 0.8⎦ ⎣0.34 0.66⎦

Copyright Vasant Honavar, 2006.

65 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Coke vs. Pepsi

Given that a person is currently a Coke drinker, what is the probability that she will purchase Pepsi three purchases from now?

3 ⎡0.9 0.1⎤⎡0.83 0.17⎤ ⎡0.781 0.219⎤ A = ⎢ ⎥⎢ ⎥ = ⎢ ⎥ ⎣0.2 0.8⎦⎣0.34 0.66⎦ ⎣0.438 0.562⎦

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Coke vs. Pepsi Assume each person makes one cola purchase per week. Suppose 60% of all people now drink Coke, and 40% drink Pepsi. What fraction of people will be drinking Coke three weeks from now?

a00 Let (q0,q1)=(0.6,0.4) be the initial probabilities. We will denote Coke by 0 and Pepsi by 1 ⎡0.9 0.1⎤ A = ⎢ ⎥ ⎣0.2 0.8⎦ We want to find P(X3=0)

1 (3) (3) (3) P(X 3 = 0) = ∑ qi ai0 = q0 a00 + q1q10 = ()()()()0.6 0.781 + 0.4 0.438 = 0.6438 i=0

Copyright Vasant Honavar, 2006.

66 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning the conditional probability table

∑occurrences of AB Naïve: Just observe a lot of strings p(B | A) = strings and set the ∑occurrences of A conditional probabilities strings equal to observed 1+ # AB probabilities ∑ strings Better: add 1 to top and p(B | A) = Nsymbols + ∑# A number of symbols to strings bottom - a weak uniform prior over the transition probabilities.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Hidden Markov Models

In many scenarios states cannot be directly observed. We need an extension -- Hidden Markov Models

a a a 11 a22 33 44

a a 12 a23 34 b b 11 b 14 13 b + b + b + b = 1, b12 11 12 13 14 4 b + b + b + b = 1, etc. 1 21 22 23 24 2 3 Observations aij are state transition probabilities.

bik are observation (output) probabilities.

Copyright Vasant Honavar, 2006.

67 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Hidden Markov Models

We introduce hidden states to get a hidden Markov model: – The next hidden state depends only on the current hidden state, but hidden states can carry along information from more than one time-step in the past. – The current symbol depends only on the current hidden state.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Example: Dishonest Casino

What is hidden in this model? – State sequences You are allowed to see the outcome of a die roll You do not know which outcomes were obtained by a fair die and which outcomes were obtained by a loaded die

Copyright Vasant Honavar, 2006.

68 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

What is an HMM?

• Green circles are hidden states • Each hidden state is dependent only on the previous state: Markov process • “The past is independent of the future given the present.”

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

What is an HMM?

• Purple nodes are observed states • Each observed state is dependent only on the corresponding hidden state

Copyright Vasant Honavar, 2006.

69 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Specifying HMM

A A A A X1 X2 Xt

B B B

O1 O2 Ot

•{X,O, ∏, A, B}

• Π = {πι} are the initial state probabilities

• A = {aij} are the state transition probabilities

• B = {bik} are the observation state probabilities

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory A hidden Markov model

.7 .7 .3 i .2 j

.1 .3 .6 .4 .6 0 0 .1 0 .5 A B C k A B C .5 0 .2 .8 A B C

Each hidden node has a vector of transition probabilities and a vector of output probabilities.

Copyright Vasant Honavar, 2006.

70 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Start Coin-Tossing tail 1/2 1/2 1/2 tail Example 0.1 1/4 Fair loaded 0.9 0.1 1/2 3/4 0.9 head head

Fair/Loaded L tosses

X1 X2 Xi XL-1 XL

O1 O2 Oi OL-1 OL

Head/Tail Query: what are the most likely values in the X-nodes to generate the given data?

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Fundamental problems

• Likelihood – Compute the probability of a given observation sequence given a model • Decoding – Given an observation sequence, and a model, compute the most likely hidden state sequence • Learning – Given an observation sequence and set of possible models, which model most closely fits the data?

Copyright Vasant Honavar, 2006.

71 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Generating a string from an HMM

It is easy to generate strings if we know the parameters of the model. At each time step, make two random choices: • Use the transition probabilities from the current hidden node to pick the next hidden node. • Use the output probabilities from the current hidden node to pick the current symbol to output.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Generating a string from an HMM

It is easy to generate strings if we know the parameters of the model. We can first produce a complete hidden sequence and then allowing each hidden node in the sequence to produce one symbol.

• Hidden nodes only depend on previous hidden nodes

• The probability of generating a hidden sequence does not depend on the visible sequence that it generates.

Copyright Vasant Honavar, 2006.

72 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The probability of generating a hidden sequence Product of probabilities, one for each term in the sequence

T p({X }T ) = p(X ) p(X | X ) t 1 1 ∏ t t−1 t=2

This comes from the table of initial This is a transition probabilities of probability between This means a hidden nodes hidden nodes sequence of hidden nodes from time 1 to aij = p(X t = j | X t−1=i) time T Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

The joint probability of generating a hidden sequence and a visible sequence

T p({X ,O } T ) = p(X )p(O | X ) p(X | X )p(O | X ) t t 1 1 1 1 ∏ t t−1 t t t=2

a sequence of hidden the probability of states and output outputting symbol Ot symbols from state Xt

Copyright Vasant Honavar, 2006.

73 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The probability of generating a visible sequence from an HMM p({O } T ) = p({O } T | X )p(X ) t 1 ∑ t 1 X∈paths through hidden states

The same visible sequence can be produced by many different hidden sequences – There are exponentially many possible hidden sequences. T – How can we calculate p ({ O i } 1 ) ?

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Fundamental problems

• Likelihood – Compute the probability of a given observation sequence given a model • Decoding – Given an observation sequence, and a model, compute the most likely hidden state sequence • Learning – Given an observation sequence and set of possible models, which model most closely fits the data?

Copyright Vasant Honavar, 2006.

74 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

The HMM dynamic programming trick τ −1 τ DP offers an efficient way to i i i compute a sum that has exponentially many terms. j j j At each time τ we combine everything we need to know about the paths up k k k to that time The probability of having produced the sequence τ λi τ = p({Ot }1 | X τ=i) up to timeτ given that state i is used at time τ This quantity can be computed recursively: λ = p(O | X = i) λ , p(X = i | X = j) i τ τ τ ∑ j τ−1 τ τ−1 j Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Probability of an Observation Sequence

o1 ot-1 ot ot+1 oT

Given an observation sequence and a model, compute the probability of the observation sequence

O = (o1,...,oT ), μ = (A,B,Π) Compute P(O | μ)

Copyright Vasant Honavar, 2006.

75 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Probability of an observation sequence

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

P(O | X,μ) = b b ...b x1o1 x2o2 xT oT P(X | μ) = π a a ...a x1 x1x2 x2 x3 xT −1xT P(O, X | μ) = P(O | X , μ)P(X | μ) P(O | μ) = ∑ P(O | X , μ)P(X | μ) X

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Probability of an Observation Sequence

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

T −1 P(O | μ) = π b a b ∑ x1 x1o1 Π xt xt +1 xt +1ot +1 { X1...XT } t=1

Copyright Vasant Honavar, 2006.

76 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Probability of an observation sequence

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

• Special structure gives us an efficient solution using dynamic programming. •Intuition – Probability of the first t observations is the same for all possible t + 1 length state sequences. • Define: αi (t) = P(o1...ot , xt = i | μ)

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Forward Procedure

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

= P(o ...o , x = j) α j (t +1) 1 t+1 t+1

= P(o1...ot+1 | xt+1 = j)P(xt+1 = j)

= P(o1...ot | xt+1 = j)P(ot+1 | xt+1 = j)P(xt+1 = j)

= P(o1...ot )P(ot+1 | xt+1 = j)P(xt+1 = j)

= P(o1...ot , xt+1 = j)P(ot+1 | xt+1 = j)

Copyright Vasant Honavar, 2006.

77 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Forward Procedure

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

α j ()t +1 = ∑ P(o1...ot ,xt = i,xt+1 = j)P(ot+1 | xt+1 = j) i=1...N

= ∑ P(o1...ot ,xt+1 = j | xt = i)P(xt = i)P(ot+1 | xt+1 = j) i=1...N

= ∑ P(o1...ot ,xt = i)P(xt+1 = j | xt = i)P(ot+1 | xt+1 = j) i=1...N = α (t)a b ∑ i ij jot +1 i=1...N

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Backward Procedure

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

β (T +1) =1 i Probability of the rest β i (t) = P(ot +1...oT | xt = i) of the observations given the state at time t β (t) = a b β (t +1) i ∑ ij iot+1 j j=1...N

Copyright Vasant Honavar, 2006.

78 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Sequence probability

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

N Forward Procedure P(O | μ) = ∑αi (T) i=1 N P(O | μ) = π b β (1) Backward Procedure ∑ i i,o1 i i=1 N P(O | μ) = ∑αi (t)βi (t) Combination i=1 Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Fundamental problems

• Likelihood – Compute the probability of a given observation sequence given a model • Decoding – Given an observation sequence, and a model, compute the most likely hidden state sequence • Learning – Given an observation sequence and set of possible models, which model most closely fits the data?

Copyright Vasant Honavar, 2006.

79 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The most probable State Sequence

o1 ot-1 ot ot+1 oT

Find the state sequence that best explains the observations arg max P(X | O) X

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Viterbi Algorithm

x1 xt-1 j

ot-1 ot ot+1 oT o1

δ j (t) = max P(x1...xt−1,o1...ot−1, xt = j,ot ) x1...xt−1 The probability of the state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t

Copyright Vasant Honavar, 2006.

80 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Viterbi Algorithm

x1 xt-1 xt xt+1

o1 ot-1 ot ot+1 oT

δ j (t) = max P(x1...xt−1,o1...ot−1, xt = j,ot ) x1...xt−1

δ j (t +1) = maxδi (t)aijbjo i t+1 Recursive Computation ψ (t +1) = arg maxδ (t)a b j i ij jot+1 i

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Viterbi Algorithm

x1 xt-1 xt xt+1 xT

o1 ot-1 ot ot+1 oT

ˆ Compute the most X T = arg maxδ i (T ) i likely state sequence ˆ by working backwards X t =ψ ^ (t +1) X t+1

Copyright Vasant Honavar, 2006.

81 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Fundamental problems

• Likelihood – Compute the probability of a given observation sequence given a model • Decoding – Given an observation sequence, and a model, compute the most likely hidden state sequence • Learning – Given an observation sequence and set of possible models, which model most closely fits the data?

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning the parameters of an HMM

• Its easy to learn the parameters if, for each observed sequence of symbols, we can infer the posterior distribution across the sequences of hidden states • We can infer which hidden state sequence gave rise to an observed sequence by using the dynamic programming trick.

Copyright Vasant Honavar, 2006.

82 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Learning HMM – Parameter Estimation

A A A A

B B B B B

o1 ot-1 ot ot+1 oT

• Given an observation sequence, find the model that is most likely to produce that sequence. • Given a model and observation sequence, update the model parameters to better fit the observations.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The probability of generating a visible sequence from an HMM p(O) = ∑ p(O | X )p(X ) X∈hidden paths

The same visible sequence can be produced by many different hidden sequences

A B C D

Copyright Vasant Honavar, 2006.

83 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

The posterior probability of a hidden path given a visible sequence p(X ) p(O | X ) p(X | O) = ∑ p(Y ) p(O |Y ) Y∈hidden paths

a hidden path

The sum in the denominator could be computed efficiently using the dynamic programming trick. But for learning we do not need to know about entire hidden paths.

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Learning the parameters of an HMM

• Its easy to learn the parameters if, for each observed sequence of symbols, we can infer the posterior probability for each hidden node at each time step. • We can infer these posterior probabilities by using the dynamic programming trick.

Copyright Vasant Honavar, 2006.

84 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The HMM dynamic programming trick t −1 t i i i

α (t) = p(O ...O , x = i) i 1 t t j j j

k k k

α (t) = α (t −1) a b i ∑ j ji i,ot j∈hidden states

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

The dynamic programming trick again t −1 t

βi ()t = p(Ot+1...OT | xt = i) i i i

j j j k k k

β (t) = a b β (t +1) i ∑ ij i,ot+1 j j∈hidden states

Copyright Vasant Honavar, 2006.

85 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The forward-backward algorithm (Baum-Welch algorithm)

• We do a forward pass along the observed string to compute the alpha’s at each time step for each node • We do a backward pass along the observed string to compute the beta’s at each time step for each node • Once we have the alpha’s and beta’s at each time step, its easy to re-estimate the output probabilities and transition probabilities

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Learning the parameters of the HMM • To learn the transition matrix – we need to know the expected number of times that each transition between two hidden nodes was used when generating the observed sequence.

• To learn the output probabilities – we need to know the expected number of times each node was used to generate each symbol.

Because of hidden states, we use expectation maximization (EM) algorithm

Copyright Vasant Honavar, 2006.

86 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory The re-estimation equations (the M-step of the EM procedure) For the transition probability from node i to node j: Count(i to j transitions in the data) anew = ij ∑Count(i to k transitions in the data) k∈hidden states

For the probability that node i generates symbol A:

Count(state i produces symbol A in the data) b (A) = i ∑Count(state i produces symbol B in the data) B∈symbols

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Summing the expectations over time

The expected number of times that node i produces symbol A requires a summation over all the different times in the sequence when there was an A.

Count(state i produces symbol A) = ∑ p(X t = i | O) t:Ot =A The expected number of times that the transition from i to j occurred requires a summation over all pairs of adjacent times in the sequence

T −1 Count(transitions from state i to state j) = p(x =i, x = j | O) ∑ t t+1 t=1

Copyright Vasant Honavar, 2006.

87 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Combine the past and the future to get the full posterior • To re-estimate the output probabilities, we need to compute the posterior probability of being at a particular hidden node at a particular time. • This requires a summation of the posterior probabilities of all the paths that go through that node at that time. t i i i i i

j j j j j

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Combining past and future

p(O, xt = i) = αi (t) βi (t)

p(O, x =i) α (t) β (t) p(x =i | O) = t = i i t p(O) p(O)

α (t) a b β (t +1) p(x =i, x = j | O) = i ij j,Ot+1 j t t+1 p(O)

Copyright Vasant Honavar, 2006.

88 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Parameter Estimation: Baum-Welch or Forward-Backward A A A A

BBB B B

o1 ot-1 ot ot+1 oT

α (t)a b β (t +1) Probability of i ij jot+1 j pt (i, j) = traversing an arc ∑α m (t)βm (t) m=1...N Probability of being γ i (t) = ∑ pt (i, j) in state i j=1...N

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Parameter Estimation: Baum-Welch Algorithm

A A A A

BBB B B

o1 ot-1 ot ot+1 oT

T p (i, j) ∑t=1 t γ t (i) aˆ = ∑{t:ot =k} Now we can ij T bˆ = γ (t) ik T compute the new ∑t=1 i γ (t) ∑t=1 i estimates of the model parameters. πˆ i = γ i (1)

Copyright Vasant Honavar, 2006.

89 Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory HMM Parameter estimation in practice

Sparseness of data requires • Smoothing of estimates using Laplace estimates (as in Naïve Bayes) to give suitable nonzero probability to unseen observations • Domain specific tricks – Feature decomposition (capitalized?, number?, etc. in text processing) gives a better estimate • Shrinkage allows pooling of estimates over multiple states of same type • Well designed HMM topology

Copyright Vasant Honavar, 2006.

Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory

Copyright Vasant Honavar, 2006.

90