bgrosofy's picture
From bgrosofy rss RSS  subscribe Subscribe

Markov Models 

 

 
 
Tags:  find a realator  markov chain  mrf  markov 
Views:  109
Published:  October 06, 2011
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
 Tips for Finding a Course to Learn Spanish Online

Tips for Finding a Course to Learn Spanish Online

From: desea
Views: 258 Comments: 0
Tips for Finding a Course to Learn Spanish Online
 
 Where To Find A Friend To Help You Learn Spanish

Where To Find A Friend To Help You Learn Spanish

From: adese
Views: 223 Comments: 0
Where To Find A Friend To Help You Learn Spanish
 
Ralph a-terrible-realator

Ralph a-terrible-realator

From: catai
Views: 187 Comments: 0

 
Find A Coach

Find A Coach

From: findacoach
Views: 116 Comments: 0
Find A Coach
 
How To Find A Kelp Paddy

How To Find A Kelp Paddy

From: skarlapengkie
Views: 5 Comments: 0
http://www.bdoutdoors.com/How To Find A Kelp Paddy
by visiting this link

 
See all 
 
More from this user
Technology IS The Answer For Today\’s (and Tomorrow\&rsquo ;s) Mortgage Industry

Technology IS The Answer For Today\’s (and Tomorrow\’s) Mortgage Industry

From: bgrosofy
Views: 136
Comments: 0

Hult\’s Global MBA bochure

Hult\’s Global MBA bochure

From: bgrosofy
Views: 133
Comments: 0

Ше&#108 3;ех&#1 086;в, Online FAQ 2011

Шелехов, Online FAQ 2011

From: bgrosofy
Views: 42
Comments: 0

Zend Framework For Enterprise PHP On IBM i

Zend Framework For Enterprise PHP On IBM i

From: bgrosofy
Views: 4123
Comments: 0

Characters at work - The use of personas in product management - brainmates

Characters at work - The use of personas in product management - brainmates

From: bgrosofy
Views: 15
Comments: 0

 
See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: PATTERN RECOGNITION Markov models Vu PHAM phvu@fit.hcmus.edu.vn Department of Computer Science March 28th, 2011 28/03/2011 Markov models 1
Slide 2: Contents • Introduction – Introduction – Motivation • Markov Chain • Hidden Markov Models • Markov Random Field 28/03/2011 Markov models 2
Slide 3: Introduction • Markov processes are first proposed by Russian mathematician Andrei Markov – He used these processes to investigate Pushkin’s poem. • Nowadays, Markov property and HMMs are widely used in many domains: – Natural Language Processing – Speech Recognition – Bioinformatics – Image/video processing – ... 28/03/2011 Markov models 3
Slide 4: Motivation [0] • As shown in his paper in 1906, Markov’s original motivation is purely mathematical: – Application of The Weak Law of Large Number to dependent random variables. • However, we shall not follow this motivation... 28/03/2011 Markov models 4
Slide 5: Motivation [1] • From the viewpoint of classification: – Context-free classification: Bayes classifier p (ωi | x ) > p (ω j | x ) ∀j ≠ i 28/03/2011 Markov models 5
Slide 6: Motivation [1] • From the viewpoint of classification: – Context-free classification: Bayes classifier p (ωi | x ) > p (ω j | x ) ∀j ≠ i • Classes are independent. • Feature vectors are independent. 28/03/2011 Markov models 6
Slide 7: Motivation [1] • From the viewpoint of classification: – Context-free classification: Bayes classifier p (ωi | x ) > p (ω j | x ) ∀j ≠ i – However, there are some applications where various classes are closely realated: • POS Tagging, Tracking, Gene boundary recover... s1 28/03/2011 s2 s3 Markov models ... sm ... 7
Slide 8: Motivation [1] • Context-dependent classification: s1 s2 s3 ... sm ... – s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k. 28/03/2011 Markov models 8
Slide 9: Motivation [1] • Context-dependent classification: s1 s2 s3 ... sm ... – s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k. • To apply Bayes classifier: – X = s1s2...sm: extened feature vector – Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications p ( Ωi | X ) > p ( Ω j | X ) ∀j ≠ i p ( X | Ωi ) p ( Ωi ) > p ( X | Ω j ) p ( Ω j ) ∀j ≠ i 28/03/2011 Markov models 9
Slide 10: Motivation [1] • Context-dependent classification: s1 s2 s3 ... sm ... – s1, s2, ..., sm: sequence of m feature vector – ω1, ω2,..., ωN: classes in which these vectors are classified: ωi = 1...k. • To apply Bayes classifier: – X = s1s2...sm: extened feature vector – Ωi = ωi1, ωi2,..., ωiN : a classification Nm possible classifications p ( Ωi | X ) > p ( Ω j | X ) ∀j ≠ i p ( X | Ωi ) p ( Ωi ) > p ( X | Ω j ) p ( Ω j ) ∀j ≠ i 28/03/2011 Markov models 10
Slide 11: Motivation [2] • From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables 28/03/2011 Markov models 11
Slide 12: Motivation [2] • From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào... Hôm nay mùng ... vào ... q1 q2 q3 qm 28/03/2011 Markov models 12
Slide 13: Motivation [2] • From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào... Hôm nay mùng ... vào ... q1 q2 q3 qm • What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)? 28/03/2011 Markov models 13
Slide 14: Motivation [2] • From a general view, sometimes we want to evaluate the joint distribution of a sequence of dependent random variables Hôm nay mùng tám tháng ba Chị em phụ nữ đi ra đi vào... Hôm nay mùng ... vào ... q1 q2 q3 qm • What is p(Hôm nay.... vào) = p(q1=Hôm q2=nay ... qm=vào)? p(s1s2... sm-1 sm) p(sm|s1s2...sm-1) = p(s1s2... sm-1) 28/03/2011 Markov models 14
Slide 15: Contents • Introduction • Markov Chain • Hidden Markov Models • Markov Random Field 28/03/2011 Markov models 15
Slide 16: Markov Chain • Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,... • On the t’th timestep the system is in exactly one of the available states. Call it qt ∈ {s1 , s2 ,..., sN } s2 s1 s3 Current state N=3 t=0 q t = q 0 = s3 28/03/2011 Markov models 16
Slide 17: Markov Chain • Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,... • On the t’th timestep the system is in exactly one of the available states. Call it qt ∈ {s1 , s2 ,..., sN } • Between each timestep, the next state is chosen randomly. N=3 t=1 q t = q 1 = s2 28/03/2011 Markov models 17 s2 s1 Current state s3
Slide 18: Markov Chain • Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,... p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0 s2 s1 • On the t’th timestep the system is in exactly one of the available states. p ( qt +1 = s1 ˚ qt = s1 ) = 0 Call it qt ∈ {s1 , s2 ,..., sN } p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 s3 p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 • Between each timestep, the next state is chosen randomly. • The current state determines the probability for the next state. 28/03/2011 Markov models N=3 t=1 q t = q 1 = s2 18
Slide 19: Markov Chain • Has N states, called s1, s2, ..., sN • There are discrete timesteps, t=0, t=1,... 1/2 p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0 1/2 s2 2/3 • On the t’th timestep the system is in 1/3 1 exactly one of the available states. p ( qt +1 = s1 ˚ qt = s1 ) = 0 Call it qt ∈ {s1 , s2 ,..., sN } p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 s1 s3 • Between each timestep, the next state is chosen randomly. • The current state determines the probability for the next state. – Often notated with arcs between states 28/03/2011 Markov models p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 N=3 t=1 q t = q 1 = s2 19
Slide 20: Markov Property • qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/2 p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0 1/2 s2 2/3 s1 1 1/3 = p ( qt +1 ˚ qt ) p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 s3 p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 N=3 t=1 q t = q 1 = s2 28/03/2011 Markov models 20
Slide 21: Markov Property • qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/2 p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0 1/2 s2 2/3 s1 1 1/3 = p ( qt +1 ˚ qt ) The state at timestep t+1 depends only on the state at timestep t p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 s3 p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 N=3 t=1 q t = q 1 = s2 28/03/2011 Markov models 21
Slide 22: Markov Property • qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/2 p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0 1/2 s2 2/3 s1 1 1/3 = p ( qt +1 ˚ qt ) The state at timestep t+1 depends only on the state at timestep t p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 s3 p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 A Markov chain of order m (m finite): the state at timestep t+1 depends on the past m states: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt , qt −1 ,..., qt − m +1 ) 28/03/2011 Markov models N=3 t=1 q t = q 1 = s2 22
Slide 23: Markov Property • qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/2 p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 p ( s3 ˚ s2 ) = 0 1/2 s2 2/3 s1 1 1/3 = p ( qt +1 ˚ qt ) The state at timestep t+1 depends only on the state at timestep t • How to represent the joint distribution of (q0, q1, q2...) using graphical models? 28/03/2011 Markov models p ( qt +1 = s1 ˚ qt = s1 ) = 0 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 s3 p ( s1 ˚ s3 ) = 1 3 p ( s2 ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 N=3 t=1 q t = q 1 = s2 23
Slide 24: Markov Property • qt+1 is conditionally independent of {qt-1, qt-2,..., q0} given qt. • In other words: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) 1/2 p ( s1 ˚ s2 ) = 1 2 p ( s2 ˚ s2 ) = 1 2 q0p ( s 1/2 3 ˚ s2 ) = 0 s2 1/3 s1 1 q1 1/3 = p ( qt +1 ˚ qt ) The state at timestep t+1 depends only on the state at timestep t • How to represent the joint distribution of (q0, q1, q2...) using graphical models? 28/03/2011 Markov models p ( qt +1 = s1 ˚ qt = s1 ) = 0 q2 p ( s2 ˚ s1 ) = 0 p ( s3 ˚ s1 ) = 1 s3 p ( s1 ˚ s3 ) = 1 3 2 q3 p ( s ˚ s3 ) = 2 3 p ( s3 ˚ s3 ) = 0 N=3 t=1 q t = q 1 = s2 24
Slide 25: Markov chain • So, the chain of {qt} is called Markov chain q0 q1 q2 q3 28/03/2011 Markov models 25
Slide 26: Markov chain • So, the chain of {qt} is called Markov chain q0 q1 q2 q3 • Each qt takes value from the countable state-space {s1, s2, s3...} • Each qt is observed at a discrete timestep t • {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt ) 28/03/2011 Markov models 26
Slide 27: Markov chain • So, the chain of {qt} is called Markov chain q0 q1 q2 q3 • Each qt takes value from the countable state-space {s1, s2, s3...} • Each qt is observed at a discrete timestep t • {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt ) • The transition from qt to qt+1 is calculated from the transition probability matrix s1 s2 s3 1/2 1/2 s2 s1 1 28/03/2011 2/3 1/3 s1 s2 s3 Markov models 0 ½ 1/3 0 ½ 2/3 1 0 0 27 s3 Transition probabilities
Slide 28: Markov chain • So, the chain of {qt} is called Markov chain q0 q1 q2 q3 • Each qt takes value from the countable state-space {s1, s2, s3...} • Each qt is observed at a discrete timestep t • {qt} sastifies the Markov property: p ( qt +1 ˚ qt , qt −1 ,..., q0 ) = p ( qt +1 ˚ qt ) • The transition from qt to qt+1 is calculated from the transition probability matrix s1 s2 s3 1/2 1/2 s2 s1 1 28/03/2011 2/3 1/3 s1 s2 s3 Markov models 0 ½ 1/3 0 ½ 2/3 1 0 0 28 s3 Transition probabilities
Slide 29: Markov Chain – Important property • In a Markov chain, the joint distribution is p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 ) j =1 m 28/03/2011 Markov models 29
Slide 30: Markov Chain – Important property • In a Markov chain, the joint distribution is p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 ) j =1 m • Why? p ( q0 , q1 ,..., qm ) = p ( q0 ) ∏ p ( q j | q j −1 , previous states ) j =1 m m = p ( q0 ) ∏ p ( q j | q j −1 ) j =1 Due to the Markov property 28/03/2011 Markov models 30
Slide 31: Markov Chain: e.g. • The state-space of weather: rain wind cloud 28/03/2011 Markov models 31
Slide 32: Markov Chain: e.g. • The state-space of weather: 1/2 rain 1/2 1/3 2/3 wind Rain 1 Cloud Wind Rain ½ 1/3 0 Cloud 0 0 1 Wind ½ 2/3 0 cloud 28/03/2011 Markov models 32
Slide 33: Markov Chain: e.g. • The state-space of weather: 1/2 rain 1/2 1/3 2/3 wind Rain 1 Cloud Wind Rain ½ 1/3 0 Cloud 0 0 1 Wind ½ 2/3 0 cloud • Markov assumption: weather in the t+1’th day is depends only on the t’th day. 28/03/2011 Markov models 33
Slide 34: Markov Chain: e.g. • The state-space of weather: 1/2 rain 1/2 1/3 2/3 wind Rain 1 Cloud Wind Rain ½ 1/3 0 Cloud 0 0 1 Wind ½ 2/3 0 cloud • Markov assumption: weather in the t+1’th day is depends only on the t’th day. • We have observed the weather in a week: rain wind cloud rain wind Day: 28/03/2011 0 1 2 Markov models 3 4 34
Slide 35: Markov Chain: e.g. • The state-space of weather: 1/2 rain 1/2 1/3 2/3 wind Rain 1 Cloud Wind Rain ½ 1/3 0 Cloud 0 0 1 Wind ½ 2/3 0 cloud • Markov assumption: weather in the t+1’th day is depends only on the t’th day. • We have observed the weather in a week: rain wind cloud Markov Chain wind rain Day: 28/03/2011 0 1 2 Markov models 3 4 35
Slide 36: Contents • Introduction • Markov Chain • Hidden Markov Models – Independent assumptions – Formal definition – Forward algorithm – Viterbi algorithm – Baum-Welch algorithm • Markov Random Field 28/03/2011 Markov models 36
Slide 37: Modeling pairs of sequences • In many applications, we have to model pair of sequences • Examples: – POS tagging in Natural Language Processing (assign each word in a sentence to Noun, Adj, Verb...) – Speech recognition (map acoustic sequences to sequences of words) – Computational biology (recover gene boundaries in DNA sequences) – Video tracking (estimate the underlying model states from the observation sequences) – And many others... 28/03/2011 Markov models 37
Slide 38: Probabilistic models for sequence pairs • We have two sequences of random variables: X1, X2, ..., Xm and S1, S2, ..., Sm • Intuitively, in a pratical system, each Xi corresponds to an observation and each Si corresponds to a state that generated the observation. • Let each Si be in {1, 2, ..., k} and each Xi be in {1, 2, ..., o} • How do we model the joint distribution: p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm ) 28/03/2011 Markov models 38
Slide 39: Hidden Markov Models (HMMs) • In HMMs, we assume that p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., Sm = sm ) = p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 ) ∏ p ( X j = x j ˚ S j = s j ) j =2 j =1 m m • This is often called Independence assumptions in HMMs • We are gonna prove it in the next slides 28/03/2011 Markov models 39
Slide 40: Independence Assumptions in HMMs [1] p ( ABC ) = p ( A | BC ) p ( BC ) = p ( A | BC ) p ( B ˚ C ) p ( C ) • By the chain rule, the following equality is exact: p ( X 1 = x1 ,..., X m = xm , S1 = s1 ,..., S m = sm ) = p ( S1 = s1 ,..., S m = sm ) × p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm ) • Assumption 1: the state sequence forms a Markov chain p ( S1 = s1 ,..., S m = sm ) = p ( S1 = s1 ) ∏ p ( S j = s j ˚ S j −1 = s j −1 ) j =2 m 28/03/2011 Markov models 40
Slide 41: Independence Assumptions in HMMs [2] • By the chain rule, the following equality is exact: p ( X 1 = x1 ,..., X m = xm ˚ S1 = s1 ,..., S m = sm ) = ∏ p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 ) j =1 m • Assumption 2: each observation depends only on the underlying state p ( X j = x j ˚ S1 = s1 ,..., Sm = sm , X 1 = x1 ,..., X j −1 = x j −1 ) = p( X j = xj ˚ S j = sj ) • These two assumptions are often called independence assumptions in HMMs 28/03/2011 Markov models 41
Slide 42: The Model form for HMMs • The model takes the following form: p ( x1 ,.., xm , s1 ,..., sm ;θ ) = π ( s1 ) ∏ t ( s j ˚ s j −1 ) ∏ e ( x j ˚ s j ) j =2 j =1 m m • Parameters in the model: – Initial probabilities π ( s ) for s ∈ {1, 2,..., k } – Transition probabilities t ( s ˚ s′ ) for s, s ' ∈ {1, 2,..., k } – Emission probabilities e ( x ˚ s ) for s ∈ {1, 2,..., k } and x ∈ {1, 2,.., o} 28/03/2011 Markov models 42
Slide 43: 6 components of HMMs • Discrete timesteps: 1, 2, ... • Finite state space: {si} (N states) • Events {xi} (M events) t11 π1 start π2 t12 t21 e13 e31 e22 t23 t32 e23 π3 t31 • Vector of initial probabilities {πi} π Π = {πi } = { p(q1 = si) } • Matrix of transition probabilities T = {Tij} = { p(qt+1=sj|qt=si) } • Matrix of emission probabilities E = {Eij} = { p(ot=xj|qt=si) } s1 e11 s2 s3 e33 x1 x2 x3 The observations at continuous timesteps form an observation sequence {o1, o2, ..., ot}, where oi ∈ {x1, x2, ..., xM} 28/03/2011 Markov models 43
Slide 44: 6 components of HMMs • Discrete timesteps: 1, 2, ... • Finite state space: {si} (N states) • Events {xi} (M events) t11 π1 start π2 t12 t21 e13 e31 e22 t23 t32 e23 π3 t31 • Vector of initial probabilities {πi} π Π = {πi } = { p(q1 = si) } • Matrix of transition probabilities T = {Tij} = { p(qt+1=sj|qt=si) } • Matrix of emission probabilities E = {Eij} = { p(ot=xj|qt=si) } s1 e11 s2 s3 e33 x1 x2 x3 Constraints: N N M The observations at continuous timesteps form an observation sequence πi = 1 {o1, o2, ..., ot}, where oi ∈ {x1Tij 2=..., xM} Eij = 1 ,x , 1 ∑ i =1 ∑ j =1 ∑ j =1 28/03/2011 Markov models 44
Slide 45: 6 components of HMMs • Given a specific HMM and an observation sequence, the corresponding sequence of states is generally not deterministic • Example: Given the observation sequence: {x1, x3, x3, x2} The corresponding states can be any of following sequences: {s1, s2, s1, s2} {s1, s2, s3, s2} {s1, s1, s1, s2} ... 28/03/2011 Markov models 45 start π1 t11 π2 t12 t21 e13 e31 e22 t23 t32 e23 π3 t31 s1 e11 s2 s3 e33 x1 x2 x3
Slide 46: Here’s an HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 T s1 s2 s3 s1 0.5 0.4 0.2 s2 0.5 0 0.8 s3 0 0.6 0 E s1 s2 s3 x2 x1 0.3 0 0.2 x2 0 0.1 0 x3 x3 0.7 0.9 0.8 π s1 0.3 s2 0.3 s3 0.4 28/03/2011 Markov models 46
Slide 47: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 π T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3 x2 x3 • Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations: 0.3 - 0.3 - 0.4 randomply choice between S1, S2, S3 0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8 Markov models q1 q2 q3 o1 o2 o3 47 28/03/2011
Slide 48: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 π T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3 x2 x3 • Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations: 0.2 - 0.8 choice between X1 and X3 0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8 Markov models q1 q2 q3 S3 o1 o2 o3 48 28/03/2011
Slide 49: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 π T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3 x2 x3 • Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations: Go to S2 with probability 0.8 or S1 with prob. 0.2 0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8 Markov models q1 q2 q3 S3 o1 o2 o3 X3 28/03/2011 49
Slide 50: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 π T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3 x2 x3 • Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations: 0.3 - 0.7 choice between X1 and X3 0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8 Markov models q1 q2 q3 S3 S1 o1 o2 o3 X3 28/03/2011 50
Slide 51: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 π T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3 x2 x3 • Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations: Go to S2 with probability 0.5 or S1 with prob. 0.5 0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8 Markov models q1 q2 q3 S3 S1 o1 o2 o3 X3 X1 28/03/2011 51
Slide 52: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 π T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3 x2 x3 • Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations: 0.3 - 0.7 choice between X1 and X3 0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8 Markov models q1 q2 q3 S3 S1 S1 o1 o2 o3 X3 X1 28/03/2011 52
Slide 53: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 π T s1 s2 s3 s1 0.3 s1 0.5 0.4 0.2 s2 0.3 s2 0.5 0 0.8 s3 x2 x3 • Start randomly in state 1, 2 or 3. • Choose a output at each state in random. • Let’s generate a sequence of observations: We got a sequence of states and corresponding observations! 0.4 s3 0 0.6 0 E s1 s2 s3 x1 0.3 0 0.2 x2 0 0.1 0 x3 0.7 0.9 0.8 Markov models q1 q2 q3 S3 S1 S1 o1 o2 o3 X3 X1 X3 53 28/03/2011
Slide 54: Three famous HMM tasks • Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O) • Most likely expaination (inference) – Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O) • Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 54
Slide 55: Three famous HMM tasks • Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O) Calculating the probability of observing the sequence O over all of possible sequences. • Most likely expaination (inference) – Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O) • Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 55
Slide 56: Three famous HMM tasks • Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O) Calculating the best corresponding state sequence, given an observation sequence. • Most likely expaination (inference) – Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O) • Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 56
Slide 57: Three famous HMM tasks • Given a HMM Φ = (T, E, π). Three famous HMM tasks are: • Probability of an observation sequence (state estimation) – Given: Φ, observation O = {o1, o2,..., ot} – Goal: p(O|Φ), or equivalently p(st = Si|O) Given an (or a set of) observation sequence and corresponding state sequence, estimate the Transition matrix, Emission matrix and initial probabilities of the HMM • Most likely expaination (inference) – Given: Φ, the observation O = {o1, o2,..., ot} – Goal: Q* = argmaxQ p(Q|O) • Learning the HMM – Given: observation O = {o1, o2,..., ot} and corresponding state sequence – Goal: estimate parameters of the HMM Φ = (T, E, π) 28/03/2011 Markov models 57
Slide 58: Three famous HMM tasks Problem State estimation Calculating: p(O|Φ) Inference Calculating: Q*= argmaxQp(Q|O) Learning Calculating: Φ* = argmaxΦp(O|Φ) Baum-Welch (EM) O(TN2) Viterbi decoding O(TN2) Algorithm Forward Complexity O(TN2) T: number of timesteps N: number of states 28/03/2011 Markov models 58
Slide 59: State estimation problem • Given: Φ = (T, E, π), observation O = {o1, o2,..., ot} • Goal: What is p(o1o2...ot) ? • We can do this in a slow, stupid way – As shown in the next slide... 28/03/2011 Markov models 59
Slide 60: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 • What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way: p (O ) = = x1 x2 x3 ∑ Q∈paths of length 3 p ( OQ ) p (O | Q ) p (Q ) ∑ Q∈ Q∈paths of length 3 • How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q? 28/03/2011 Markov models 60
Slide 61: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 • What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way: p (O ) = = x1 π s1 0.3 s2 0.3 x2 s3 0.4 x3 ∑ Q∈paths of length 3 p ( OQ ) p (O | Q ) p (Q ) ∑ Q∈ Q∈paths of length 3 p(Q) = p(q1q2q3) = p(q1)p(q2|q1)p(q3|q2,q1) (chain) = p(q1)p(q2|q1)p(q3|q2) (why?) Example in the case Q=S3S1S1 P(Q) = 0.4 * 0.2 * 0.5 = 0.04 28/03/2011 • How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q? Markov models 61
Slide 62: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 • What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way: p (O ) = = x1 π s1 0.3 s2 0.3 x2 s3 0.4 x3 ∑ Q∈paths of length 3 p ( OQ ) p (O | Q ) p (Q ) ∑ Q∈ Q∈paths of length 3 p(O|Q) = p(o1o2o3|q1q2q3) = p(o1|q1)p(o2|q1)p(o3|q3) (why?) Example in the case Q=S3S1S1 P(O|Q) = p(X3|S3)p(X1|S1) p(X3|S1) =0.8 * 0.3 * 0.7 = 0.168 28/03/2011 • How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q? Markov models 62
Slide 63: Here’s a HMM 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 • What is p(O) = p(o1o2o3) = p(o1=X3 ∧ o2=X1 ∧ o3=X3)? • Slow, stupid way: p (O ) = = x1 π s1 0.3 s2 0.3 x2 s3 0.4 x3 ∑ Q∈paths of length 3 p ( OQ ) p (O | Q ) p (Q ) ∑ Q∈ Q∈paths of length 3 p(O|Q) = p(o1o2o3|q1q2q3) p(O) needs 27 p(Q) = p(o1|q1)p(o2|q1)p(o3|q3) (why?) computations and 27 p(O|Q) computations. • How to compute p(Q) for an arbitrary path Q? • How to compute p(O|Q) for an arbitrary path Q? So let’s be smarter... 63 Example in the case Q=S3S1S1 What if the P(O|Q) = p(X3|S3)p(Xsequence3has ) 1|S1) p(X |S1 20 observations? =0.8 * 0.3 * 0.7 = 0.168 28/03/2011 Markov models
Slide 64: The Forward algorithm • Given observation o1o2...oT • Forward probabilities: αt(i) = p(o1o2...ot ∧ qt = si | Φ) where 1 ≤ t ≤ T αt(i) = probability that, in a random trial: – We’d have seen the first t observations – We’d have ended up in si as the t’th state visited. • In our example, what is α2(3) ? 28/03/2011 Markov models 64
Slide 65: αt(i): easy to define recursively α t ( i ) = p ( o1o2 ...ot ∧ qt = si | Φ ) α1 ( i ) = p ( o1 ∧ q1 = si ) = p ( q1 = si ) p ( o1 | q1 = si ) = π i Ei ( o1 ) Π = {π i } = { p ( q1 = si )} { E = {E } = { p ( o = x ij t T = {Tij } = p ( qt +1 = s j | qt = si ) j t i } | q = s )} α t +1 ( i ) = p ( o1o2 ...ot +1 ∧ qt +1 = si ) = ∑ p ( o1o2 ...ot ∧ qt = s j ∧ ot +1 ∧ qt +1 = si ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = si | o1o2 ...ot ∧ qt = s j ) p ( o1o2 ...ot ∧ qt = s j ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = si | qt = s j ) α t ( j ) j =1 N = ∑ p ( ot +1 | qt +1 = si ) p ( qt +1 = si | qt = s j ) α t ( j ) j =1 N = ∑T ji Ei ( ot +1 ) α t ( j ) j =1 N 28/03/2011 Markov models 65
Slide 66: In our example αt ( i ) = p ( o1o2 ...ot ∧ qt = si | Φ ) α1 ( i ) = Ei ( o1 ) π i αt +1 ( i ) = ∑Tji Ei ( ot +1 ) αt ( j ) = Ei ( ot +1 ) ∑Tjiαt ( j ) j j 0.5 0.2 s1 0.3 0.2 0.5 0.4 0.7 s2 0.1 0.6 0.8 0.9 s3 0.8 x1 x2 π s1 0.3 s2 0.3 x3 s3 0.4 We observed: α1(1) = 0.3 * 0.3 = 0.09 α1(2) = 0 α1(3) = 0.2 * 0.4 = 0.08 x1x2 α2(1) = 0 * (0.09*0 .5+ 0*0.4 + 0.08*0.2) = 0 α2(2) = 0.1 * (0.09*0.5 + 0*0 + 0.08*0.8) = 0.0109 α2(3) = 0 * (0.09*0 + 0*0.6 + 0.08*0) = 0 28/03/2011 Markov models 66
Slide 67: Forward probabilities - Trellis N s4 s3 s2 s1 1 28/03/2011 2 3 4 Markov models 5 6 T 67
Slide 68: Forward probabilities - Trellis N s4 α1 (4) s3 α1 (3) α2 (3) α6 (3) s2 α1 (2) α3 (2) α5 (2) s1 α1 (1) α4 (1) 1 28/03/2011 2 3 4 Markov models 5 6 T 68
Slide 69: Forward probabilities - Trellis N s4 α1 (4) α1 ( i ) = Ei ( o1 ) π i s3 α1 (3) α2 (3) s2 α1 (2) s1 α1 (1) 1 28/03/2011 2 3 4 Markov models 5 6 T 69
Slide 70: Forward probabilities - Trellis N s4 α1 (4) αt +1 ( i ) = Ei ( ot +1 ) ∑Tjiαt ( j ) j s3 α1 (3) α2 (3) s2 α1 (2) s1 α1 (1) 1 28/03/2011 2 3 4 Markov models 5 6 T 70
Slide 71: Forward probabilities • So, we can cheaply compute: αt ( i ) = p ( o1o2 ...ot ∧ qt = si ) p ( o1 o 2 ...o t ) • How can we cheaply compute: • How can we cheaply compute: p ( q t = s i | o1 o 2 ...o t ) 28/03/2011 Markov models 71
Slide 72: Forward probabilities • So, we can cheaply compute: αt ( i ) = p ( o1o2 ...ot ∧ qt = si ) • How can we cheaply compute: p ( o1 o 2 ...o t ) = • How can we cheaply compute: ∑ α (i ) t i αt ( i ) p ( q t = s i | o1 o 2 ...o t ) = ∑α t ( j ) j Look back the trellis... 28/03/2011 Markov models 72
Slide 73: State estimation problem • State estimation is solved: p ( O | Φ ) = p ( o1o2 … ot ) = ∑ α i ( i ) i =1 N • Can we utilize the elegant trellis to solve the Inference problem? – Given an observation sequence O, find the best state sequence Q Q = arg max p ( Q | O ) * Q 28/03/2011 Markov models 73
Slide 74: Inference problem • Given: Φ = (T, E, π), observation O = {o1, o2,..., ot} • Goal: Find Q * = arg max p ( Q | O ) Q = arg max p ( q1q2 … qt | o1o2 … ot ) q1q2 … qt • Practical problems: – Speech recognition: Given an utterance (sound), what is the best sentence (text) that matches the utterance? – Video tracking – POS Tagging 28/03/2011 Markov models s1 x1 s2 x2 s3 x3 74
Slide 75: Inference problem • We can do this in a slow, stupid way: Q * = arg max p ( Q | O ) Q = arg max Q p (O | Q ) p (Q ) p (O ) = arg max p ( O | Q ) p ( Q ) Q = arg max p ( o1o2 … ot | Q ) p ( Q ) Q • But it’s better if we can find another way to compute the most probability path (MPP)... 28/03/2011 Markov models 75
Slide 76: Efficient MPP computation • We are going to compute the following variables: δ t ( i ) = max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 …ot ) q1q2 …qt −1 • δt(i) is the probability of the best path of length t-1 which ends up in si and emits o1...ot. • Define: mppt(i) = that path so: 28/03/2011 δt(i) = p(mppt(i)) Markov models 76
Slide 77: Viterbi algorithm δ t ( i ) = max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 … ot ) q1q2 …qt −1 mppt ( i ) = arg max p ( q1q2 … qt −1 ∧ qt = si ∧ o1o2 …ot ) q1q2 …qt −1 δ1 ( i ) = max p ( q1 = si ∧ o1 ) one choice = π i Ei ( o1 ) = α1 ( i ) s4 s3 s2 s1 N δ (4) 1 δ 1 (3) δ 1 (2) δ 2 (3) δ 1 (1) 1 2 3 4 Markov models 5 6 T 77 28/03/2011
Slide 78: Viterbi algorithm time t time t + 1 • The most prob path with last two states sisj is the most path to si, followed by transition si sj. s1 si ... ... 28/03/2011 ... sj • The prob of that path will be: δt(i) × p(si sj ∧ ot+1) = δt(i)TijEj(ot+1) • So, the previous state at time t is: i* = arg max δ t ( i ) Tij E j ( ot +1 ) i Markov models 78
Slide 79: Viterbi algorithm • Summary: δ t +1 ( j ) = δ t ( i* ) Tij E j ( ot +1 ) mppt +1 ( j ) = mppt i* s j i* = arg max δ t ( i ) Tij E j ( ot +1 ) i δ1 ( i ) = π i Ei ( o1 ) = α1 ( i ) () s4 s3 s2 s1 N δ (4) 1 δ 1 (3) δ 1 (2) δ 2 (3) δ 1 (1) 1 2 3 4 Markov models 5 6 T 79 28/03/2011
Slide 80: What’s Viterbi used for? • Speech Recognition Chong, Jike and Yi, Youngmin and Faria, Arlo and Satish, Nadathur Rajagopalan and Keutzer, Kurt, “Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors”, EECS Department, University of California, Berkeley, 2008. 28/03/2011 Markov models 80
Slide 81: Training HMMs • Given: large sequence of observation o1o2...oT and number of states N. • Goal: Estimation of parameters Φ = ⟨T, E, π⟩ • That is, how to design an HMM. • We will infer the model from a large amount of data o1o2...oT with a big “T”. 28/03/2011 Markov models 81
Slide 82: Training HMMs • Remember, we have just computed p(o1o2...oT | Φ) • Now, we have some observations and we want to inference Φ from them. • So, we could use: – MAX LIKELIHOOD: – BAYES: Compute p ( Φ | o1 … oT ) then take E [ Φ ] or max p ( Φ | o1 … oT ) Φ Φ = arg max p ( o1 … oT | Φ ) Φ 28/03/2011 Markov models 82
Slide 83: Max likelihood for HMMs • Forward probability: the probability of producing o1...ot while ending up in state si αt ( i ) = p ( o1o2 ...ot ∧ qt = si ) α1 ( i ) = Ei ( o1 ) π i α t +1 ( i ) = Ei ( ot +1 ) ∑ T jiα t ( j ) j • Backward probability: the probability of producing ot+1...oT given that at time t, we are at state si βt ( i ) = p ( ot +1ot +2 ...oT | qt = si ) 28/03/2011 Markov models 83
Slide 84: Max likelihood for HMMs - Backward • Backward probability: easy to define recursively βt ( i ) = p ( ot +1ot +2 ...oT | qt = si ) βT ( i ) = 1 βt ( i ) = ∑ p ( ot +1 ∧ ot +2 ...oT ∧ qt +1 = s j | qt = si ) j =1 N βT ( i ) = 1 βt ( i ) = ∑ βt +1 ( j ) Tij E j ( ot +1 ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = s j | qt = si ) p ( ot + 2 ...oT | ot +1 ∧ qt +1 = s j ∧ qt = si ) j =1 N = ∑ p ( ot +1 ∧ qt +1 = s j | qt = si ) p ( ot + 2 ...oT | qt +1 = s j ) j =1 N = ∑ βt +1 ( j ) Tij E j ( ot +1 ) j =1 28/03/2011 Markov models 84 N
Slide 85: Max likelihood for HMMs • The probability of traversing a certain arc at time t given o1o2...oT: ε ij ( t ) = p ( qt = si ∧ qt +1 = s j | o1o2 …oT ) = = p ( qt = si ∧ qt +1 = s j ∧ o1o2 …oT ) p ( o1o2 …oT ) p ( o1o2 … ot ∧ qt = si ) p ( qt = si ∧ qt +1 = s j ) p ( ot +1ot + 2 …oT | qt = si ) ∑ p ( o o …o ∧ q 12 t i =1 N t = si ) p ( ot +1ot + 2 … oT | qt = si ) ε ij ( t ) = α t ( i ) Tij β t ( i ) ∑α (i ) β (i ) t t i =1 Markov models 85 N 28/03/2011
Slide 86: Max likelihood for HMMs • The probability of being at state si at time t given o1o2...oT: γ i ( t ) = p ( qt = si | o1o2 …oT ) = ∑ p ( qt = si ∧ qt +1 = s j | o1o2 …oT ) j =1 N γ i ( t ) = ∑ ε ij ( t ) j =1 N 28/03/2011 Markov models 86
Slide 87: Max likelihood for HMMs • Sum over the time index: – Expected # of transitions from state i to j in o1o2...oT: ∑ ε (t ) ij t =1 T −1 – Expected # of transitions from state i in o1o2...oT : ∑ γ ( t ) = ∑∑ ε ( t ) = ∑ ∑ε ( t ) i ij ij t =1 t =1 j =1 j =1 t =1 28/03/2011 Markov models 87 T −1 T −1 N N T −1
Slide 88: Update parameters Π = {π i } = { p ( q1 = si )} { E = {E } = { p ( o = x ij t T = {Tij } = p ( qt +1 = s j | qt = si ) j t i } | q = s )} ˆ π i = expected frequency in state i at time t = 1 = γ i (1) expected # of transitions from state i to j Tij = = expected # of transitions from state i ∑ ε (t ) ij t =1 T −1 t =1 T −1 ∑ ε (t ) ij T −1 ∑ γ ( t ) ∑∑ ε ( t ) i ij j =1 t =1 = t =1 N T −1 expected # of transitions from state i with x k observed Eik = expected # of transitions from state i ∑ δ ( o , x ) γ ( t ) ∑∑ δ ( o , x ) ε ( t ) t k i t k ij T −1 t =1 N T −1 = ∑ γ (t ) i t =1 T −1 = j =1 t =1 N T −1 j =1 t =1 ∑∑ ε ( t ) ij Markov models 88 28/03/2011
Slide 89: The inner loop of Forward-Backward Given an input sequence. 1. Calculate forward probability: – Base case: α1 ( i ) = Ei ( o1 ) π i j – Recursive case: α t +1 ( i ) = Ei ( ot +1 ) ∑ T jiα t ( j ) 2. Calculate backward probability: – – Base case: βT ( i ) = 1 N j =1 Recursive case: βt ( i ) = ∑ βt +1 ( j ) Tij E j ( ot +1 ) 3. Calculate expected count: 4. Update parameters: T −1 ∑ ε ij ( t ) Tij = 28/03/2011 N T −1 ε ij ( t ) = α t ( i ) Tij βt ( i ) ∑α (i ) β (i ) t t i =1 N ∑∑ δ ( o , x ) ε ( t ) t k ij j =1 t =1 N T −1 j =1 t =1 ∑∑ ε ( t ) ij j =1 t =1 t =1 N T −1 Eik = ∑∑ ε ( t ) ij Markov models 89
Slide 90: Forward-Backward: EM for HMM • If we knew Φ we could estimate expectations of quantities such as – Expected number of times in state i – Expected number of transitions i j • If we knew the quantities such as – Expected number of times in state i – Expected number of transitions i j we could compute the max likelihood estimate of Φ = ⟨T, E, Π⟩ • Also known (for the HMM case) as the Baum-Welch algorithm. 28/03/2011 Markov models 90
Slide 91: EM for HMM • Each iteration provides values for all the parameters • The new model always improve the likeliness of the training data: ˆ p o1o2 …oT | Φ ≥ p ( o1o2 …oT | Φ ) • The algorithm does not guarantee to reach global maximum. ( ) 28/03/2011 Markov models 91
Slide 92: EM for HMM • Bad News – There are lots of local minima • Good News – The local minima are usually adequate models of the data. • Notice – EM does not estimate the number of states. That must be given (tradeoffs) – Often, HMMs are forced to have some links with zero probability. This is done by setting Tij = 0 in initial estimate Φ(0) – Easy extension of everything seen today: HMMs with real valued outputs 28/03/2011 Markov models 92
Slide 93: Contents • Introduction • Markov Chain • Hidden Markov Models • Markov Random Field (from the viewpoint of classification) 28/03/2011 Markov models 93
Slide 94: Example: Image segmentation • Observations: pixel values • Hidden variable: class of each pixel • It’s reasonable to think that there are some underlying relationships between neighbouring pixels... Can we use Markov models? • Errr.... the relationships are in 2D! 28/03/2011 Markov models 94
Slide 95: MRF as a 2D generalization of MC • Array of observations: • Classes/States: S = {sij } , X = { xij } , sij = 1...M 0 ≤ i < Nx , 0 ≤ j < N y • Our objective is classification: given the array of observations, estimate the corresponding values of the state array S so that p( X | S ) p(S ) is maximum. 28/03/2011 Markov models 95
Slide 96: 2D context-dependent classification • Assumptions: – The values of elements in S are mutually dependent. – The range of this dependence is limited within a neighborhood. • For each (i, j) element of S, a neighborhood Nij is defined so that – sij ∉ Nij: (i, j) element does not belong to its own set of neighbors. – sij ∈ Nkl ⇔ skl ∈ Nij: if sij is a neighbor of skl then skl is also a neighbor of sij 28/03/2011 Markov models 96
Slide 97: 2D context-dependent classification • The Markov property for 2D case: p sij | Sij = p ( sij | N ij ) where Sij includes all the elements of S except the (i, j) one. • The elegeant dynamic programing is not applicable: the problem is much harder now! ( ) 28/03/2011 Markov models 97
Slide 98: 2D context-dependent classification • The Markov property for 2D case: p sij | Sij = p ( sij | N ij ) where Sij includes all the elements of S except the (i, j) one. • ( ) We are gonna see an The elegeant dynamic programing is not applicable: the problem is application of MRF for much harder now! Image Segmentation and Restoration. 28/03/2011 Markov models 98
Slide 99: MRF for Image Segmentation • Cliques: a set of each pixel which are neighbors of each other (w.r.t the type of neighborhood) 28/03/2011 Markov models 99
Slide 100: MRF for Image Segmentation • Dual Lattice number • Line process: 28/03/2011 Markov models 100
Slide 101: MRF for Image Segmentation • Gibbs distribution: −U ( s ) 1 π ( s ) = exp Z T – Z: normalizing constant – T: parameter • It turns out that Gibbs distribution implies MRF ([Gema 84]) 28/03/2011 Markov models 101
Slide 102: MRF for Image Segmentation • A Gibbs conditional probability is of the form: 1 1  p ( sij | N ij ) = exp  − ∑ Fk ( Ck ( i, j ) )  Z Tk  – Ck(i, j): clique of the pixel (i, j) – Fk: some functions, e.g. 1 − sij α1 + α 2 ( si −1, j + si +1, j ) + α 2 ( si , j −1 + si , j +1 ) T ( ) 28/03/2011 Markov models 102
Slide 103: MRF for Image Segmentation • Then, the joint probability for the Gibbs model is  ∑∑ Fk ( Ck ( i, j ) )   i, j k  p ( S ) = exp  −  T     – The sum is calculated over all possible cliques associated with the neighborhood. • We also need to work out p(X|S) • Then p(X|S)p(S) can be maximized... [Gema 84] 28/03/2011 Markov models 103
Slide 104: More on Markov models... • MRF does not stop there... Here are some related models: – Conditional random field (CRF) – Graphical models – ... • Markov Chain and HMM does not stop there... – Markov chain of order m – Continuous-time Markov chains... – Real-value observations – ... 28/03/2011 Markov models 104
Slide 105: What you should know • Markov property, Markov Chain • HMM: – Defining and computing αt(i) – Viterbi algorithm – Outline of the EM algorithm for HMM • Markov Random Field – And an application in Image Segmentation – [Geman 84] for more information. 28/03/2011 Markov models 105
Slide 106: Q&A 28/03/2011 Markov models 106
Slide 107: References • L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition“, Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. • • Andrew W. Moore, “Hidden Markov Models”, http://www.autonlab.org/tutorials/ Geman S., Geman D. “Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 6(6), pp. 721-741, 1984. 28/03/2011 Markov models 107

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location