Slide 1: PATTERN RECOGNITION
Markov models
Vu PHAM phvu@fit.hcmus.edu.vn
Department of Computer Science March 28th, 2011
28/03/2011
Markov models
1
Slide 2: Contents
• Introduction
– Introduction – Motivation
• Markov Chain • Hidden Markov Models • Markov Random Field
28/03/2011 Markov models 2
Slide 3: Introduction
• Markov processes are first proposed by Russian mathematician Andrei Markov
– He used these processes to investigate Pushkin’s poem.
• Nowadays, Markov property and HMMs are widely used in many domains:
– Natural Language Processing – Speech Recognition – Bioinformatics – Image/video processing – ...
28/03/2011 Markov models 3
Slide 4: Motivation [0]
• As shown in his paper in 1906, Markov’s original motivation is purely mathematical:
– Application of The Weak Law of Large Number to dependent random variables.
• However, we shall not follow this motivation...
28/03/2011
Markov models
4
Slide 5: Motivation [1]
• From the viewpoint of classification:
– Context-free classification: Bayes classifier
p (ωi | x ) > p (ω j | x ) ∀j ≠ i
28/03/2011
Markov models
5
Slide 6: Motivation [1]
• From the viewpoint of classification:
– Context-free classification: Bayes classifier
p (ωi | x ) > p (ω j | x ) ∀j ≠ i
• Classes are independent. • Feature vectors are independent.
28/03/2011
Markov models
6
Slide 7: Motivation [1]
• From the viewpoint of classification:
– Context-free classification: Bayes classifier
p (ωi | x ) > p (ω j | x ) ∀j ≠ i
– However, there are some applications where various classes are closely realated:
• POS Tagging, Tracking, Gene boundary recover... s1
28/03/2011
s2
s3
Markov models
...
sm
...
7
Slide 8: Motivation [1]
• Context-dependent classification: s1 s2 s3 ... sm ...
– s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.
28/03/2011
Markov models
8
Slide 9: Motivation [1]
• Context-dependent classification: s1 s2 s3 ... sm ...
– s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.
• To apply Bayes classifier:
– X = s1s2...sm: extened feature vector – Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications
p ( Ωi | X ) > p ( Ω j | X ) ∀j ≠ i
p ( X | Ωi ) p ( Ωi ) > p ( X | Ω j ) p ( Ω j ) ∀j ≠ i
28/03/2011 Markov models 9
Slide 10: Motivation [1]
• Context-dependent classification: s1 s2 s3 ... sm ...
– s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k.
• To apply Bayes classifier:
– X = s1s2...sm: extened feature vector – Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications
p ( Ωi | X ) > p ( Ω j | X ) ∀j ≠ i
p ( X | Ωi ) p ( Ωi ) > p ( X | Ω j ) p ( Ω j ) ∀j ≠ i
28/03/2011 Markov models 10
Slide 11: Motivation [2]
• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables
28/03/2011
Markov models
11
Slide 12: Motivation [2]
• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào...
Hôm nay
mùng
...
vào
...
q1
q2
q3
qm
28/03/2011
Markov models
12
Slide 13: Motivation [2]
• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào...
Hôm nay
mùng
...
vào
...
q1
q2
q3
qm
• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)?
28/03/2011
Markov models
13
Slide 14: Motivation [2]
• From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables
Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào...
Hôm nay
mùng
...
vào
...
q1
q2
q3
qm
• What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)? p(s1s2... sm-1 sm) p(sm|s1s2...sm-1) = p(s1s2... sm-1)
28/03/2011 Markov models 14
Slide 15: Contents
• Introduction • Markov Chain • Hidden Markov Models • Markov Random Field
28/03/2011
Markov models
15
Slide 16: Markov Chain
• Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,... • On the t’th timestep the system is in exactly one of the available states. Call it qt ∈ {s1 , s2 ,..., sN }
s2 s1 s3
Current state
N=3 t=0 q t = q 0 = s3
28/03/2011 Markov models 16
Slide 17: Markov Chain
• Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,... • On the t’th timestep the system is in exactly one of the available states. Call it qt ∈ {s1 , s2 ,..., sN } • Between each timestep, the next state is chosen randomly. N=3 t=1 q t = q 1 = s2
28/03/2011 Markov models 17
s2 s1
Current state
s3
Slide 18: Markov Chain
• Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,...
p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0
s2 s1
• On the t’th timestep the system is in exactly one of the available states. p ( qt +1 = s1 ˚ qt = s1 ) = 0 Call it qt ∈ {s1 , s2 ,..., sN }
p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1
s3
p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0
• Between each timestep, the next state is chosen randomly. • The current state determines the probability for the next state.
28/03/2011 Markov models
N=3 t=1 q t = q 1 = s2
18
Slide 19: Markov Chain
• Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,...
1/2
p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0
1/2
s2
2/3
• On the t’th timestep the system is in 1/3 1 exactly one of the available states. p ( qt +1 = s1 ˚ qt = s1 ) = 0 Call it qt ∈ {s1 , s2 ,..., sN }
p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1
s1
s3
• Between each timestep, the next state is chosen randomly. • The current state determines the probability for the next state.
– Often notated with arcs between states
28/03/2011 Markov models
p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0
N=3 t=1 q t = q 1 = s2
19
Slide 20: Markov Property
• qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 )
1/2
p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0
1/2
s2
2/3
s1
1 1/3
= p ( qt +1 ˚ qt )
p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1
s3
p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0
N=3 t=1 q t = q 1 = s2
28/03/2011 Markov models 20
Slide 21: Markov Property
• qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 )
1/2
p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0
1/2
s2
2/3
s1
1 1/3
= p ( qt +1 ˚ qt )
The state at timestep t+1 depends only on the state at timestep t
p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1
s3
p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0
N=3 t=1 q t = q 1 = s2
28/03/2011 Markov models 21
Slide 22: Markov Property
• qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 )
1/2
p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0
1/2
s2
2/3
s1
1 1/3
= p ( qt +1 ˚ qt )
The state at timestep t+1 depends only on the state at timestep t
p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1
s3
p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0
A Markov chain of order m (m finite): the state at timestep t+1 depends on the past m states:
p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt , qt −1 ,..., qt − m +1 )
28/03/2011 Markov models
N=3 t=1 q t = q 1 = s2
22
Slide 23: Markov Property
• qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 )
1/2
p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0
1/2
s2
2/3
s1
1 1/3
= p ( qt +1 ˚ qt )
The state at timestep t+1 depends only on the state at timestep t • How to represent the joint distribution of (q0, q1, q2...) using graphical models?
28/03/2011 Markov models
p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1
s3
p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0
N=3 t=1 q t = q 1 = s2
23
Slide 24: Markov Property
• qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 )
1/2
p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2
q0p ( s
1/2
3
˚ s2 ) = 0
s2
1/3
s1
1
q1
1/3
= p ( qt +1 ˚ qt )
The state at timestep t+1 depends only on the state at timestep t • How to represent the joint distribution of (q0, q1, q2...) using graphical models?
28/03/2011 Markov models
p ( qt +1 = s1 ˚ qt = s1 ) = 0 q2 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1
s3
p ( s1 ˚ s3 ) = 1 3
2
q3 p ( s
˚ s3 ) = 2 3
p ( s3 ˚ s3 ) = 0
N=3 t=1 q t = q 1 = s2
24
Slide 25: Markov chain
• So, the chain of {qt} is called Markov chain q0 q1 q2 q3
28/03/2011
Markov models
25
Slide 26: Markov chain
• So, the chain of {qt} is called Markov chain q0 q1 q2 q3 • Each qt takes value from the countable state-space {s1, s2, s3...} • Each qt is observed at a discrete timestep t • {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt )
28/03/2011
Markov models
26
Slide 27: Markov chain
• So, the chain of {qt} is called Markov chain q0 q1 q2 q3 • Each qt takes value from the countable state-space {s1, s2, s3...} • Each qt is observed at a discrete timestep t • {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt ) • The transition from qt to qt+1 is calculated from the transition probability matrix s1 s2 s3 1/2
1/2
s2
s1
1
28/03/2011
2/3 1/3
s1 s2 s3
Markov models
0 ½ 1/3
0 ½ 2/3
1 0 0
27
s3
Transition probabilities
Slide 28: Markov chain
• So, the chain of {qt} is called Markov chain q0 q1 q2 q3 • Each qt takes value from the countable state-space {s1, s2, s3...} • Each qt is observed at a discrete timestep t • {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt ) • The transition from qt to qt+1 is calculated from the transition probability matrix s1 s2 s3 1/2
1/2
s2
s1
1
28/03/2011
2/3 1/3
s1 s2 s3
Markov models
0 ½ 1/3
0 ½ 2/3
1 0 0
28
s3
Transition probabilities
Slide 29: Markov Chain – Important property
• In a Markov chain, the joint distribution is
p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 )
j =1 m
28/03/2011
Markov models
29
Slide 30: Markov Chain – Important property
• In a Markov chain, the joint distribution is
p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 )
j =1 m
• Why?
p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 , previous states )
j =1 m
m
= p ( q0 ) ∏ p ( q j | q j −1 )
j =1
Due to the Markov property
28/03/2011 Markov models 30
Slide 31: Markov Chain: e.g.
• The state-space of weather:
rain wind
cloud
28/03/2011
Markov models
31
Slide 32: Markov Chain: e.g.
• The state-space of weather:
1/2
rain
1/2 1/3 2/3
wind
Rain 1 Cloud Wind
Rain ½ 1/3 0
Cloud 0 0 1
Wind ½ 2/3 0
cloud
28/03/2011
Markov models
32
Slide 33: Markov Chain: e.g.
• The state-space of weather:
1/2
rain
1/2 1/3 2/3
wind
Rain 1 Cloud Wind
Rain ½ 1/3 0
Cloud 0 0 1
Wind ½ 2/3 0
cloud
• Markov assumption: weather in the t+1’th day is depends only on the t’th day.
28/03/2011
Markov models
33
Slide 34: Markov Chain: e.g.
• The state-space of weather:
1/2
rain
1/2 1/3 2/3
wind
Rain 1 Cloud Wind
Rain ½ 1/3 0
Cloud 0 0 1
Wind ½ 2/3 0
cloud
• Markov assumption: weather in the t+1’th day is depends only on the t’th day. • We have observed the weather in a week:
rain wind
cloud
rain
wind
Day:
28/03/2011
0
1
2
Markov models
3
4
34
Slide 35: Markov Chain: e.g.
• The state-space of weather:
1/2
rain
1/2 1/3 2/3
wind
Rain 1 Cloud Wind
Rain ½ 1/3 0
Cloud 0 0 1
Wind ½ 2/3 0
cloud
• Markov assumption: weather in the t+1’th day is depends only on the t’th day. • We have observed the weather in a week:
rain wind
cloud
Markov Chain
wind
rain
Day:
28/03/2011
0
1
2
Markov models
3
4
35
Slide 36: Contents
• Introduction • Markov Chain • Hidden Markov Models
– Independent assumptions – Formal definition – Forward algorithm – Viterbi algorithm – Baum-Welch algorithm
• Markov Random Field
28/03/2011 Markov models 36
Slide 37: Modeling pairs of sequences
• In many applications, we have to model pair of sequences • Examples:
– POS tagging in Natural Language Processing (assign each word in a sentence to Noun, Adj, Verb...) – Speech recognition (map acoustic sequences to sequences of words) – Computational biology (recover gene boundaries in DNA sequences) – Video tracking (estimate the underlying model states from the observation sequences) – And many others...
28/03/2011
Markov models
37
Slide 38: Probabilistic models for sequence pairs
• We have two sequences of random variables: X1, X2, ..., Xm and S1, S2, ..., Sm • Intuitively, in a pratical system, each Xi corresponds to an observation and each Si corresponds to a state that generated the observation. • Let each Si be in {1, 2, ..., k} and each Xi be in {1, 2, ..., o} • How do we model the joint distribution:
p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm )
28/03/2011
Markov models
38
Slide 39: Hidden Markov Models (HMMs)
• In HMMs, we assume that
p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., Sm = sm ) = p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 ) ∏ p ( X j = x j ˚ S j = s j )
j =2 j =1 m m
• This is often called Independence assumptions in HMMs • We are gonna prove it in the next slides
28/03/2011 Markov models 39
Slide 40: Independence Assumptions in HMMs [1]
p ( ABC ) = p ( A | BC ) p ( BC ) = p ( A | BC ) p ( B ˚ C ) p ( C )
• By the chain rule, the following equality is exact:
p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm ) = p ( S1 = s1 ,..., S m = sm ) × p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm )
• Assumption 1: the state sequence forms a Markov chain
p ( S1 = s1 ,..., S m = sm ) = p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 )
j =2 m
28/03/2011
Markov models
40
Slide 41: Independence Assumptions in HMMs [2]
• By the chain rule, the following equality is exact:
p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm ) = ∏ p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 )
j =1 m
• Assumption 2: each observation depends only on the underlying state
p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 ) = p( X j = xj ˚ S j = sj )
• These two assumptions are often called independence assumptions in HMMs
28/03/2011 Markov models 41
Slide 42: The Model form for HMMs
• The model takes the following form:
p ( x1 ,.., xm , s1 ,..., sm ;θ ) = π ( s1 ) ∏ t ( s j ˚ s j −1 ) ∏ e ( x j ˚ s j )
j =2 j =1 m m
• Parameters in the model:
– Initial probabilities π ( s ) for s ∈ {1, 2,..., k } – Transition probabilities t ( s ˚ s′ ) for s, s ' ∈ {1, 2,..., k } – Emission probabilities e ( x ˚ s ) for s ∈ {1, 2,..., k }
and x ∈ {1, 2,.., o}
28/03/2011 Markov models 42
Slide 43: 6 components of HMMs
• Discrete timesteps: 1, 2, ... • Finite state space: {si} (N states) • Events {xi} (M events)
t11 π1 start π2 t12 t21 e13 e31 e22 t23 t32 e23 π3 t31
• Vector of initial probabilities {πi} π Π = {πi } = { p(q1 = si) } • Matrix of transition probabilities T = {Tij} = { p(qt+1=sj|qt=si) } • Matrix of emission probabilities E = {Eij} = { p(ot=xj|qt=si) }
s1
e11
s2
s3
e33
x1
x2
x3
The observations at continuous timesteps form an observation sequence {o1, o2, ..., ot}, where oi ∈ {x1, x2, ..., xM}
28/03/2011 Markov models 43
Slide 44: 6 components of HMMs
• Discrete timesteps: 1, 2, ... • Finite state space: {si} (N states) • Events {xi} (M events)
t11 π1 start π2 t12 t21 e13 e31 e22 t23 t32 e23 π3 t31
• Vector of initial probabilities {πi} π Π = {πi } = { p(q1 = si) } • Matrix of transition probabilities T = {Tij} = { p(qt+1=sj|qt=si) } • Matrix of emission probabilities E = {Eij} = { p(ot=xj|qt=si) }
s1
e11
s2
s3
e33
x1
x2
x3
Constraints:
N N M The observations at continuous timesteps form an observation sequence πi = 1 {o1, o2, ..., ot}, where oi ∈ {x1Tij 2=..., xM} Eij = 1 ,x , 1
∑
i =1
∑
j =1
∑
j =1
28/03/2011
Markov models
44
Slide 45: 6 components of HMMs
• Given a specific HMM and an observation sequence, the corresponding sequence of states is generally not deterministic • Example: Given the observation sequence: {x1, x3, x3, x2} The corresponding states can be any of following sequences: {s1, s2, s1, s2} {s1, s2, s3, s2} {s1, s1, s1, s2} ...
28/03/2011 Markov models 45
start π1 t11 π2 t12 t21 e13 e31 e22 t23 t32 e23 π3 t31
s1
e11
s2
s3
e33
x1
x2
x3
Slide 46: Here’s an HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
T s1 s2 s3 s1 0.5 0.4 0.2 s2 0.5 0 0.8 s3 0 0.6 0 E s1 s2 s3
x2
x1 0.3 0 0.2 x2 0 0.1 0
x3
x3 0.7 0.9 0.8
π
s1 0.3
s2 0.3
s3 0.4
28/03/2011
Markov models
46
Slide 47: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
π
T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3
x2
x3
• Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations:
0.3 - 0.3 - 0.4 randomply choice between S1, S2, S3
0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8
Markov models
q1 q2 q3
o1 o2 o3
47
28/03/2011
Slide 48: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
π
T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3
x2
x3
• Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations:
0.2 - 0.8 choice between X1 and X3
0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8
Markov models
q1 q2 q3
S3
o1 o2 o3
48
28/03/2011
Slide 49: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
π
T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3
x2
x3
• Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations:
Go to S2 with probability 0.8 or S1 with prob. 0.2
0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8
Markov models
q1 q2 q3
S3
o1 o2 o3
X3
28/03/2011
49
Slide 50: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
π
T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3
x2
x3
• Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations:
0.3 - 0.7 choice between X1 and X3
0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8
Markov models
q1 q2 q3
S3 S1
o1 o2 o3
X3
28/03/2011
50
Slide 51: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
π
T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3
x2
x3
• Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations:
Go to S2 with probability 0.5 or S1 with prob. 0.5
0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8
Markov models
q1 q2 q3
S3 S1
o1 o2 o3
X3 X1
28/03/2011
51
Slide 52: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
π
T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3
x2
x3
• Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations:
0.3 - 0.7 choice between X1 and X3
0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8
Markov models
q1 q2 q3
S3 S1 S1
o1 o2 o3
X3 X1
28/03/2011
52
Slide 53: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
π
T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3
x2
x3
• Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations:
We got a sequence of states and corresponding observations!
0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8
Markov models
q1 q2 q3
S3 S1 S1
o1 o2 o3
X3 X1 X3
53
28/03/2011
Slide 54: Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation)
– Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O)
• Most likely expaination (inference)
– Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011
Markov models
54
Slide 55: Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation)
– Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O)
Calculating the probability of observing the sequence O over all of possible sequences.
• Most likely expaination (inference)
– Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011
Markov models
55
Slide 56: Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation)
– Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O)
Calculating the best corresponding state sequence, given an observation sequence.
• Most likely expaination (inference)
– Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011
Markov models
56
Slide 57: Three famous HMM tasks
• Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation)
– Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O)
Given an (or a set of) observation sequence and corresponding state sequence, estimate the Transition matrix, Emission matrix and initial probabilities of the HMM
• Most likely expaination (inference)
– Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O)
• Learning the HMM
– Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π)
28/03/2011
Markov models
57
Slide 58: Three famous HMM tasks
Problem State estimation Calculating: p(O|Φ) Inference Calculating: Q*= argmaxQp(Q|O) Learning Calculating: Φ* = argmaxΦp(O|Φ) Baum-Welch (EM) O(TN2) Viterbi decoding O(TN2) Algorithm Forward Complexity O(TN2)
T: number of timesteps N: number of states
28/03/2011 Markov models 58
Slide 59: State estimation problem
• Given: Φ = (T, E, π), observation O = {o1, o2,..., ot} • Goal: What is p(o1o2...ot) ? • We can do this in a slow, stupid way
– As shown in the next slide...
28/03/2011
Markov models
59
Slide 60: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
• What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way:
p (O ) = =
x1
x2
x3
∑
Q∈paths of length 3
p ( OQ ) p (O | Q ) p (Q )
∑
Q∈ Q∈paths of length 3
• How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q?
28/03/2011
Markov models
60
Slide 61: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
• What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way:
p (O ) = =
x1
π
s1 0.3 s2 0.3
x2
s3 0.4
x3
∑
Q∈paths of length 3
p ( OQ ) p (O | Q ) p (Q )
∑
Q∈ Q∈paths of length 3
p(Q) = p(q1q2q3) = p(q1)p(q2|q1)p(q3|q2,q1) (chain) = p(q1)p(q2|q1)p(q3|q2) (why?) Example in the case Q=S3S1S1 P(Q) = 0.4 * 0.2 * 0.5 = 0.04
28/03/2011
• How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q?
Markov models
61
Slide 62: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
• What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way:
p (O ) = =
x1
π
s1 0.3 s2 0.3
x2
s3 0.4
x3
∑
Q∈paths of length 3
p ( OQ ) p (O | Q ) p (Q )
∑
Q∈ Q∈paths of length 3
p(O|Q) = p(o1o2o3|q1q2q3) = p(o1|q1)p(o2|q1)p(o3|q3) (why?) Example in the case Q=S3S1S1 P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1) =0.8 * 0.3 * 0.7 = 0.168
28/03/2011
• How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q?
Markov models
62
Slide 63: Here’s a HMM
0.5 0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
• What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way:
p (O ) = =
x1
π
s1 0.3 s2 0.3
x2
s3 0.4
x3
∑
Q∈paths of length 3
p ( OQ ) p (O | Q ) p (Q )
∑
Q∈ Q∈paths of length 3
p(O|Q) = p(o1o2o3|q1q2q3) p(O) needs 27 p(Q) = p(o1|q1)p(o2|q1)p(o3|q3) (why?)
computations and 27 p(O|Q) computations.
• How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q?
So let’s be smarter...
63
Example in the case Q=S3S1S1 What if the P(O|Q) = p(X3|S3)p(Xsequence3has ) 1|S1) p(X |S1 20 observations? =0.8 * 0.3 * 0.7 = 0.168
28/03/2011
Markov models
Slide 64: The Forward algorithm
• Given observation o1o2...oT • Forward probabilities: αt(i) = p(o1o2...ot ∧ qt = si | Φ) where 1 ≤ t ≤ T
αt(i) = probability that, in a random trial:
– We’d have seen the first t observations – We’d have ended up in si as the t’th state visited.
• In our example, what is α2(3) ?
28/03/2011 Markov models 64
Slide 65: αt(i): easy to define recursively
α t ( i ) = p ( o1o2 ...ot ∧ qt = si | Φ )
α1 ( i ) = p ( o1 ∧ q1 = si )
= p ( q1 = si ) p ( o1 | q1 = si ) = π i Ei ( o1 )
Π = {π i } = { p ( q1 = si )}
{ E = {E } = { p ( o = x
ij t
T = {Tij } = p ( qt +1 = s j | qt = si )
j t i
} | q = s )}
α t +1 ( i ) = p ( o1o2 ...ot +1 ∧ qt +1 = si )
= ∑ p ( o1o2 ...ot ∧ qt = s j ∧ ot +1 ∧ qt +1 = si )
j =1 N
= ∑ p ( ot +1 ∧ qt +1 = si | o1o2 ...ot ∧ qt = s j ) p ( o1o2 ...ot ∧ qt = s j )
j =1
N
= ∑ p ( ot +1 ∧ qt +1 = si | qt = s j ) α t ( j )
j =1
N
= ∑ p ( ot +1 | qt +1 = si ) p ( qt +1 = si | qt = s j ) α t ( j )
j =1
N
= ∑T ji Ei ( ot +1 ) α t ( j )
j =1
N
28/03/2011
Markov models
65
Slide 66: In our example
αt ( i ) = p ( o1o2 ...ot ∧ qt = si | Φ ) α1 ( i ) = Ei ( o1 ) π i αt +1 ( i ) = ∑Tji Ei ( ot +1 ) αt ( j ) = Ei ( ot +1 ) ∑Tjiαt ( j )
j j
0.5
0.2
s1
0.3 0.2
0.5 0.4 0.7
s2
0.1
0.6 0.8 0.9
s3
0.8
x1
x2
π
s1 0.3 s2 0.3
x3
s3 0.4
We observed:
α1(1) = 0.3 * 0.3 = 0.09 α1(2) = 0 α1(3) = 0.2 * 0.4 = 0.08
x1x2
α2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0 α2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109 α2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0
28/03/2011
Markov models
66
Slide 67: Forward probabilities - Trellis
N
s4
s3
s2
s1
1
28/03/2011
2
3
4
Markov models
5
6
T
67
Slide 68: Forward probabilities - Trellis
N
s4
α1 (4)
s3
α1 (3)
α2 (3)
α6 (3)
s2
α1 (2)
α3 (2)
α5 (2)
s1
α1 (1)
α4 (1)
1
28/03/2011
2
3
4
Markov models
5
6
T
68
Slide 69: Forward probabilities - Trellis
N
s4
α1 (4)
α1 ( i ) = Ei ( o1 ) π i
s3
α1 (3)
α2 (3)
s2
α1 (2)
s1
α1 (1)
1
28/03/2011
2
3
4
Markov models
5
6
T
69
Slide 70: Forward probabilities - Trellis
N
s4
α1 (4)
αt +1 ( i ) = Ei ( ot +1 ) ∑Tjiαt ( j )
j
s3
α1 (3)
α2 (3)
s2
α1 (2)
s1
α1 (1)
1
28/03/2011
2
3
4
Markov models
5
6
T
70
Slide 71: Forward probabilities
• So, we can cheaply compute:
αt ( i ) = p ( o1o2 ...ot ∧ qt = si )
p ( o1 o 2 ...o t )
• How can we cheaply compute:
• How can we cheaply compute:
p ( q t = s i | o1 o 2 ...o t )
28/03/2011
Markov models
71
Slide 72: Forward probabilities
• So, we can cheaply compute:
αt ( i ) = p ( o1o2 ...ot ∧ qt = si )
• How can we cheaply compute: p ( o1 o 2 ...o t ) = • How can we cheaply compute:
∑ α (i )
t i
αt ( i ) p ( q t = s i | o1 o 2 ...o t ) = ∑α t ( j )
j
Look back the trellis...
28/03/2011 Markov models 72
Slide 73: State estimation problem
• State estimation is solved:
p ( O | Φ ) = p ( o1o2 … ot ) = ∑ α i ( i )
i =1 N
• Can we utilize the elegant trellis to solve the Inference problem?
– Given an observation sequence O, find the best state sequence Q
Q = arg max p ( Q | O )
* Q
28/03/2011
Markov models
73
Slide 74: Inference problem
• Given: Φ = (T, E, π), observation O = {o1, o2,..., ot} • Goal: Find
Q * = arg max p ( Q | O )
Q
= arg max p ( q1q2 … qt | o1o2 … ot )
q1q2 … qt
• Practical problems:
– Speech recognition: Given an utterance (sound), what is the best sentence (text) that matches the utterance? – Video tracking – POS Tagging
28/03/2011 Markov models
s1 x1
s2 x2
s3 x3
74
Slide 75: Inference problem
• We can do this in a slow, stupid way: Q * = arg max p ( Q | O )
Q
= arg max
Q
p (O | Q ) p (Q ) p (O )
= arg max p ( O | Q ) p ( Q )
Q
= arg max p ( o1o2 … ot | Q ) p ( Q )
Q
• But it’s better if we can find another way to compute the most probability path (MPP)...
28/03/2011 Markov models 75
Slide 76: Efficient MPP computation
• We are going to compute the following variables:
δ t ( i ) = max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 …ot )
q1q2 …qt −1
• δt(i) is the probability of the best path of length t-1 which ends up in si and emits o1...ot. • Define: mppt(i) = that path so:
28/03/2011
δt(i) = p(mppt(i))
Markov models 76
Slide 77: Viterbi algorithm
δ t ( i ) = max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 … ot )
q1q2 …qt −1
mppt ( i ) = arg max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 …ot )
q1q2 …qt −1
δ1 ( i ) = max p ( q1 = si ∧ o1 )
one choice
= π i Ei ( o1 ) = α1 ( i )
s4 s3 s2 s1
N δ (4) 1 δ 1 (3) δ 1 (2) δ 2 (3)
δ 1 (1)
1 2 3 4
Markov models
5
6
T
77
28/03/2011
Slide 78: Viterbi algorithm
time t time t + 1
• The most prob path with last two states sisj is the most path to si, followed by transition si sj.
s1 si
... ...
28/03/2011
...
sj
• The prob of that path will be: δt(i) × p(si sj ∧ ot+1)
= δt(i)TijEj(ot+1) • So, the previous state at time t is:
i* = arg max δ t ( i ) Tij E j ( ot +1 )
i
Markov models 78
Slide 79: Viterbi algorithm
• Summary:
δ t +1 ( j ) = δ t ( i* ) Tij E j ( ot +1 )
mppt +1 ( j ) = mppt i* s j i* = arg max δ t ( i ) Tij E j ( ot +1 )
i
δ1 ( i ) = π i Ei ( o1 ) = α1 ( i )
()
s4 s3 s2 s1
N δ (4) 1 δ 1 (3) δ 1 (2) δ 2 (3)
δ 1 (1)
1 2 3 4
Markov models
5
6
T
79
28/03/2011
Slide 80: What’s Viterbi used for?
• Speech Recognition
Chong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors”, EECS Department, University of California, Berkeley, 2008.
28/03/2011
Markov models
80
Slide 81: Training HMMs
• Given: large sequence of observation o1o2...oT and number of states N. • Goal: Estimation of parameters Φ = 〈T, E, π〉 • That is, how to design an HMM. • We will infer the model from a large amount of data o1o2...oT with a big “T”.
28/03/2011 Markov models 81
Slide 82: Training HMMs
• Remember, we have just computed p(o1o2...oT | Φ) • Now, we have some observations and we want to inference Φ from them. • So, we could use:
– MAX LIKELIHOOD: – BAYES: Compute p ( Φ | o1 … oT ) then take E [ Φ ] or max p ( Φ | o1 … oT )
Φ
Φ = arg max p ( o1 … oT | Φ )
Φ
28/03/2011
Markov models
82
Slide 83: Max likelihood for HMMs
• Forward probability: the probability of producing o1...ot while ending up in state si
αt ( i ) = p ( o1o2 ...ot ∧ qt = si )
α1 ( i ) = Ei ( o1 ) π i α t +1 ( i ) = Ei ( ot +1 ) ∑ T jiα t ( j )
j
• Backward probability: the probability of producing ot+1...oT given that at time t, we are at state si
βt ( i ) = p ( ot +1ot +2 ...oT | qt = si )
28/03/2011
Markov models
83
Slide 84: Max likelihood for HMMs - Backward
• Backward probability: easy to define recursively
βt ( i ) = p ( ot +1ot +2 ...oT | qt = si ) βT ( i ) = 1 βt ( i ) = ∑ p ( ot +1 ∧ ot +2 ...oT ∧ qt +1 = s j | qt = si )
j =1 N
βT ( i ) = 1 βt ( i ) = ∑ βt +1 ( j ) Tij E j ( ot +1 )
j =1 N
= ∑ p ( ot +1 ∧ qt +1 = s j | qt = si ) p ( ot + 2 ...oT | ot +1 ∧ qt +1 = s j ∧ qt = si )
j =1
N
= ∑ p ( ot +1 ∧ qt +1 = s j | qt = si ) p ( ot + 2 ...oT | qt +1 = s j )
j =1
N
= ∑ βt +1 ( j ) Tij E j ( ot +1 )
j =1
28/03/2011 Markov models 84
N
Slide 85: Max likelihood for HMMs
• The probability of traversing a certain arc at time t given o1o2...oT:
ε ij ( t ) = p ( qt = si ∧ qt +1 = s j | o1o2 …oT )
= = p ( qt = si ∧ qt +1 = s j ∧ o1o2 …oT ) p ( o1o2 …oT ) p ( o1o2 … ot ∧ qt = si ) p ( qt = si ∧ qt +1 = s j ) p ( ot +1ot + 2 …oT | qt = si )
∑ p ( o o …o ∧ q
12 t i =1
N
t
= si ) p ( ot +1ot + 2 … oT | qt = si )
ε ij ( t ) =
α t ( i ) Tij β t ( i )
∑α (i ) β (i )
t t i =1
Markov models 85
N
28/03/2011
Slide 86: Max likelihood for HMMs
• The probability of being at state si at time t given o1o2...oT:
γ i ( t ) = p ( qt = si | o1o2 …oT )
= ∑ p ( qt = si ∧ qt +1 = s j | o1o2 …oT )
j =1 N
γ i ( t ) = ∑ ε ij ( t )
j =1
N
28/03/2011
Markov models
86
Slide 87: Max likelihood for HMMs
• Sum over the time index:
– Expected # of transitions from state i to j in o1o2...oT:
∑ ε (t )
ij t =1
T −1
– Expected # of transitions from state i in o1o2...oT :
∑ γ ( t ) = ∑∑ ε ( t ) = ∑ ∑ε ( t )
i ij ij t =1 t =1 j =1 j =1 t =1
28/03/2011 Markov models 87
T −1
T −1 N
N T −1
Slide 88: Update parameters
Π = {π i } = { p ( q1 = si )}
{ E = {E } = { p ( o = x
ij t
T = {Tij } = p ( qt +1 = s j | qt = si )
j t i
} | q = s )}
ˆ π i = expected frequency in state i at time t = 1 = γ i (1) expected # of transitions from state i to j Tij = = expected # of transitions from state i
∑ ε (t )
ij t =1 T −1 t =1
T −1
∑ ε (t )
ij
T −1
∑ γ ( t ) ∑∑ ε ( t )
i ij j =1 t =1
=
t =1 N T −1
expected # of transitions from state i with x k observed Eik = expected # of transitions from state i
∑ δ ( o , x ) γ ( t ) ∑∑ δ ( o , x ) ε ( t )
t k i t k ij
T −1 t =1
N T −1
=
∑ γ (t )
i t =1
T −1
=
j =1 t =1 N T −1 j =1 t =1
∑∑ ε ( t )
ij
Markov models 88
28/03/2011
Slide 89: The inner loop of Forward-Backward
Given an input sequence. 1. Calculate forward probability:
– Base case:
α1 ( i ) = Ei ( o1 ) π i
j
– Recursive case: α t +1 ( i ) = Ei ( ot +1 ) ∑ T jiα t ( j )
2. Calculate backward probability:
– – Base case:
βT ( i ) = 1
N j =1
Recursive case: βt ( i ) = ∑ βt +1 ( j ) Tij E j ( ot +1 )
3. Calculate expected count: 4. Update parameters: T −1 ∑ ε ij ( t )
Tij =
28/03/2011
N T −1
ε ij ( t ) =
α t ( i ) Tij βt ( i )
∑α (i ) β (i )
t t i =1
N
∑∑ δ ( o , x ) ε ( t )
t k ij j =1 t =1 N T −1 j =1 t =1
∑∑ ε ( t )
ij j =1 t =1
t =1 N T −1
Eik =
∑∑ ε ( t )
ij
Markov models
89
Slide 90: Forward-Backward: EM for HMM
• If we knew Φ we could estimate expectations of quantities such as
– Expected number of times in state i – Expected number of transitions i j
• If we knew the quantities such as
– Expected number of times in state i – Expected number of transitions i j
we could compute the max likelihood estimate of Φ = 〈T, E, Π〉 • Also known (for the HMM case) as the Baum-Welch algorithm.
28/03/2011 Markov models 90
Slide 91: EM for HMM
• Each iteration provides values for all the parameters • The new model always improve the likeliness of the training data:
ˆ p o1o2 …oT | Φ ≥ p ( o1o2 …oT | Φ )
• The algorithm does not guarantee to reach global maximum.
(
)
28/03/2011
Markov models
91
Slide 92: EM for HMM
• Bad News
– There are lots of local minima
• Good News
– The local minima are usually adequate models of the data.
• Notice
– EM does not estimate the number of states. That must be given (tradeoffs) – Often, HMMs are forced to have some links with zero probability. This is done by setting Tij = 0 in initial estimate Φ(0) – Easy extension of everything seen today: HMMs with real valued outputs
28/03/2011
Markov models
92
Slide 93: Contents
• Introduction • Markov Chain • Hidden Markov Models • Markov Random Field (from the viewpoint of classification)
28/03/2011
Markov models
93
Slide 94: Example: Image segmentation
• Observations: pixel values • Hidden variable: class of each pixel • It’s reasonable to think that there are some underlying relationships between neighbouring pixels... Can we use Markov models? • Errr.... the relationships are in 2D!
28/03/2011 Markov models 94
Slide 95: MRF as a 2D generalization of MC
• Array of observations: • Classes/States:
S = {sij } , X = { xij } , sij = 1...M 0 ≤ i < Nx , 0 ≤ j < N y
• Our objective is classification: given the array of observations, estimate the corresponding values of the state array S so that p( X | S ) p(S )
is maximum.
28/03/2011
Markov models
95
Slide 96: 2D context-dependent classification
• Assumptions:
– The values of elements in S are mutually dependent. – The range of this dependence is limited within a neighborhood.
• For each (i, j) element of S, a neighborhood Nij is defined so that
– sij ∉ Nij: (i, j) element does not belong to its own set of neighbors. – sij ∈ Nkl ⇔ skl ∈ Nij: if sij is a neighbor of skl then skl is also a neighbor of sij
28/03/2011
Markov models
96
Slide 97: 2D context-dependent classification
• The Markov property for 2D case:
p sij | Sij = p ( sij | N ij )
where Sij includes all the elements of S except the (i, j) one. • The elegeant dynamic programing is not applicable: the problem is much harder now!
(
)
28/03/2011
Markov models
97
Slide 98: 2D context-dependent classification
• The Markov property for 2D case:
p sij | Sij = p ( sij | N ij )
where Sij includes all the elements of S except the (i, j) one. •
(
)
We are gonna see an The elegeant dynamic programing is not applicable: the problem is application of MRF for much harder now! Image Segmentation and Restoration.
28/03/2011
Markov models
98
Slide 99: MRF for Image Segmentation
• Cliques: a set of each pixel which are neighbors of each other (w.r.t the type of neighborhood)
28/03/2011
Markov models
99
Slide 100: MRF for Image Segmentation
• Dual Lattice number • Line process:
28/03/2011
Markov models
100
Slide 101: MRF for Image Segmentation
• Gibbs distribution:
−U ( s ) 1 π ( s ) = exp Z T
– Z: normalizing constant – T: parameter
• It turns out that Gibbs distribution implies MRF ([Gema 84])
28/03/2011 Markov models 101
Slide 102: MRF for Image Segmentation
• A Gibbs conditional probability is of the form:
1 1 p ( sij | N ij ) = exp − ∑ Fk ( Ck ( i, j ) ) Z Tk
– Ck(i, j): clique of the pixel (i, j) – Fk: some functions, e.g.
1 − sij α1 + α 2 ( si −1, j + si +1, j ) + α 2 ( si , j −1 + si , j +1 ) T
(
)
28/03/2011
Markov models
102
Slide 103: MRF for Image Segmentation
• Then, the joint probability for the Gibbs model is
∑∑ Fk ( Ck ( i, j ) ) i, j k p ( S ) = exp − T
– The sum is calculated over all possible cliques associated with the neighborhood.
• We also need to work out p(X|S) • Then p(X|S)p(S) can be maximized... [Gema 84]
28/03/2011 Markov models 103
Slide 104: More on Markov models...
• MRF does not stop there... Here are some related models:
– Conditional random field (CRF) – Graphical models – ...
• Markov Chain and HMM does not stop there...
– Markov chain of order m – Continuous-time Markov chains... – Real-value observations – ...
28/03/2011
Markov models
104
Slide 105: What you should know
• Markov property, Markov Chain • HMM:
– Defining and computing αt(i) – Viterbi algorithm – Outline of the EM algorithm for HMM
• Markov Random Field
– And an application in Image Segmentation – [Geman 84] for more information.
28/03/2011
Markov models
105
Slide 106: Q&A
28/03/2011
Markov models
106
Slide 107: References
• L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. • • Andrew W. Moore, “Hidden Markov Models”, http://www.autonlab.org/tutorials/ Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 6(6), pp. 721-741, 1984.
28/03/2011
Markov models
107