Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".

I am Ge Gao from the Center for Bioinformatics, Peking University.

Let's continue this week's topic: Markov Model.

Last time we introduced the concept of Markov Chain.

With the concept of “state”, we can distinguish between opening and extending a gap

and apply affine gap penalty correctly.

It is obvious, however, that these are not enough for real sequence alignment in practice.

The current state model

discriminates only between “gap state (X or Y)” and “match state (M)”,

but not between different residues.

Therefore, we need to introduce the Hidden Markov Model.

The Hidden Markov Model adds to the states in Markov Model the concept of Tokens.

Each state can emit a set of observable tokens with different probabilities.

In other words, aside from the transition probability,

the Hidden Markov Model has also introduced the concept of “emission probability”.

the Hidden Markov Model has also introduced the concept of “emission probability”.

which generates a set of observable tokens with different probabilities.

In contrast to Markov Model, the state path is not immediately visible in Hidden Markov Model.

That’s why it is “Hidden” Markov Model.

Rather, we need to deduce the state path by the corresponding tokens observed.

For example, the string “aabc”

can have its state path being “1,1,2,3”,or “1,2,3,3”, “1,3,3,3”, “1,1,1,1”, “2,2,2,2”, “3,3,3,3”, etc.

This is because each state is able to emit “a”, “b”, and “c”.

The problem is that the emission probabilities for “a”, “b”, and “c” are different between different states.

Thus, through different state paths we will get different emission probabilities for string "aabc".

Theoretically, we can enumerate all the possible state paths,right?

Then we can calculate a probability for each path,right?

The one whose probability is maximum is the most likely path.

For example, the probability of emitting the string “aabc” from the state path “1,2,3,3” is

the probability of emitting “a” from state 1 (which is 0.8),

multiplied by the probability of transiting from state 1 to state 2 (which is 0.3),

multiplied by the probability of emitting “a” from state 2 (which is 0.2)

multiplied by the probability of transiting from state 2 to state 3 (which is 0.5)

multiplied by the probability of transiting from state 3 to state 3 itself (which is 1),

multiplied by the probability of emitting “c” from state 3,(which is 0.1)

which ends up with 0.0072.

Actually, in this way we can compute the probability of generating the string “aabc” for any given state path.

The state path whose probability is the largest

will thus be the path most likely to generate this string.

For convenience , we will define the emission probability, denoted as e_k(b),

as the probability of emitting token b at the state S_k .

Then we can use the multiplication theorem on probability

to rewrite the probability of generating token strings Y from state path X

as the production of the emission probability e_k(b)

and the transition probability alpha_{k,l}.

Now that we have introduced the Hidden Markov Model,

let’s come back to the sequence alignment problem with affine gap penalty.

As discussed earlier, the Markov Chain model can distinguish between opening and extending a gap

by introducing the “gap states (X, Y)” and the “match state (M)”

However, it does not consider the difference between residues.

Specifically, we will use the emission probability to handle the differences between residues.

The tokens generated from state M are all possible substitutions of residues.

The corresponding emission probability is p_{a,b}.

The tokens generated from state X and Y are all possible insertions of residues.

The corresponding emission probability is q_a.

In this way we can simultaneously deal with the state and the specific residue easily.

The sequence alignment problem can thus be reformulated as a problem where we need to find the state path that

most likely to generate the given token string under the given Hidden Markov Model.

Specifically, we can use the idea of dynamic programming to solve it step by step.

We define "PM(i,j)" as the largest probability

[that can be generated from the alignment of] the subsequence of sequence X

from the first base to the i-th base and the subsequence of sequence Y

from the first base to the j-th base, given Xi aligned to Yj.

We also define "PX(i,j)" (or “PY(i,j)”) as the largest probability

[that can be generated from the alignment of] the subsequence of sequence X

from the first base to the i-th base and the subsequence of sequence Y

from the first base to the j-th base, given Xi (or Yj) aligned to a gap.

Thus we can define the recursive function at each step based on the state diagram.

The final alignment can also be recovered by the backtracking method in dynamic programming.

Let’s summarize.

In this unit, we have introduced the Hidden Markov Model

and used it to solve the sequence alignment problem with affine gap penalty.

Students who have watched the supplementary learning material may have a question now:

as we have already solved sequence alignment in Week 2

by simply introducing states M, X, and Y,

why still taking the time and effort to spend two units

introducing the Hidden Markov Model and other various new concepts

How could they still benefit us [if we have already had the old good methods]?

First, the introduction of Hidden Markov Model

gives a probabilistic interpretation of sequence alignment.

For example, the delta here can be regarded as the probability of developing an DNA insertion or deletion

during the evolutionary process.

In other words, the delta can also be regarded as the probability of developing a insertion or deletion

during the evolutionary process.

Another example is that the emission probabilities of state M

correspond straightforward to the frequency of various substitutions during the evolutionary process.

Also, with a probabilistic model we can leverage the Probability Theory to do more analyses.

For example, we can compute the identity that is

most likely for two given sequences to have, even without generating any alignment.

In fact, we have mentioned at the beginning of this unit

that the same sequence observed can come from many different state paths in HMM.

Therefore, we can get the full probability for the observed sequence by summing up

the probabilities for generating such a sequence over all possible state paths.

In the sense of sequence alignment itself,

[this full probability] is the sum of the probabilities over all possible alignments.

Or in other words, the identity that is most likely for two given sequences to have.

In practice it is very simple to calculate.

In fact, we only need to replace the maximum function with summation

at each recursion step and the final step.

This method does not depend on specific alignments,

which is especially important to cases mentioned before where multiple optimal alignments exist.

Last but not least,

we will discuss about the feature of deducing hidden states

from observed token sequence in Hidden Markov Model.

In fact, Hidden Markov Model is used more as a predictor in modern bioinformatics research.

We will introduce this in next unit.

Here are some summary questions

You are encouraged to think about them and discuss them with other students and TAs in the forum.

Thank you. Let's meet at the next unit!